The "docker stats" command provides some basic info about containers. For example:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
2c73c2e10c53 container_name 0.46% 1.422GiB / 31.39GiB 4.53% 350MB / 227MB 534MB / 1.42GB 63
I'm collecting this information with a program ( https://github.com/nagylzs/pysysinfo_influxdb ) and sending it into an InfluxDb database. I need to to run performance tests on multiple servers with this setup and analyze the results: find possible bottlenecks on all servers (CPU/network speed/memory/disk IO etc.) The information provided by "docker stats" is very coarse. Here are the problems:
- Measurements are made in every 30 sec. The block read/write and network read/write values are increasing. Once they reach 1GB, they become useless for performance analysis, because they are too coarse. For example, if I measure 29.1GB now, and 29.2GB 30 seconds later, then the actual amount could be anything between 1MB and 149MB.
- I would like to know the busy_time % (time spent with I/O) value. The raw block reads/writes does not tell too much unless I know the maximum. But since I need to monitor several servers, and their maximum performance varies, the raw block values cannot be (easily) used to identify bottlenecks.
I was also trying to get info with "docker inspect" but I do not see any usable value there.
For the network interfaces, I can imagine that this COULD work (although it is difficult to implement):
- list the network interfaces for each container, using "docker inspect"
- then go over the output of "ifconfig" and collect the "RX bytes" and "TX bytes" values
- collecting this info in every 30s becomes a problem on its own
How to do the same for disk I/O? How to get I/O busy time? This info must be available, because "docker stats" display them. Just not in the best format. Any ideas?
Best Answer
Docker containers are based on Linux cgroups, so read metrics from cgroup files. For example: see cgroup v1 doc for Block IO Controller - https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt
Example from my OS:
Doc for blkio.throttle.io_service_bytes file:
Don't expect any nice % metric values. They are only counters, so you have to calculate % from the counter values. Just find a counters, which are useful for you (I guess *wait* metrics) and you will be able to detect IO bottleneck.
You can apply similar concept also for: