Docker performance monitoring – how to query I/O busy time and exact block reads/writes

dockerperformance-monitoring

The "docker stats" command provides some basic info about containers. For example:

CONTAINER ID        NAME                                                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
2c73c2e10c53        container_name                                      0.46%               1.422GiB / 31.39GiB   4.53%               350MB / 227MB       534MB / 1.42GB      63

I'm collecting this information with a program ( https://github.com/nagylzs/pysysinfo_influxdb ) and sending it into an InfluxDb database. I need to to run performance tests on multiple servers with this setup and analyze the results: find possible bottlenecks on all servers (CPU/network speed/memory/disk IO etc.) The information provided by "docker stats" is very coarse. Here are the problems:

  • Measurements are made in every 30 sec. The block read/write and network read/write values are increasing. Once they reach 1GB, they become useless for performance analysis, because they are too coarse. For example, if I measure 29.1GB now, and 29.2GB 30 seconds later, then the actual amount could be anything between 1MB and 149MB.
  • I would like to know the busy_time % (time spent with I/O) value. The raw block reads/writes does not tell too much unless I know the maximum. But since I need to monitor several servers, and their maximum performance varies, the raw block values cannot be (easily) used to identify bottlenecks.

I was also trying to get info with "docker inspect" but I do not see any usable value there.

For the network interfaces, I can imagine that this COULD work (although it is difficult to implement):

  • list the network interfaces for each container, using "docker inspect"
  • then go over the output of "ifconfig" and collect the "RX bytes" and "TX bytes" values
  • collecting this info in every 30s becomes a problem on its own

How to do the same for disk I/O? How to get I/O busy time? This info must be available, because "docker stats" display them. Just not in the best format. Any ideas?

Best Answer

Docker containers are based on Linux cgroups, so read metrics from cgroup files. For example: see cgroup v1 doc for Block IO Controller - https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt

Example from my OS:

[root@dockerhost c2241e5663e04cbf0154b06ecb4fe7f31f67918748c3123ea7104ac8db004dae]# ls
blkio.io_merged                   blkio.io_serviced_recursive      blkio.reset_stats                blkio.throttle.write_bps_device   cgroup.event_control
blkio.io_merged_recursive         blkio.io_service_time            blkio.sectors                    blkio.throttle.write_iops_device  cgroup.procs
blkio.io_queued                   blkio.io_service_time_recursive  blkio.sectors_recursive          blkio.time                        notify_on_release
blkio.io_queued_recursive         blkio.io_wait_time               blkio.throttle.io_service_bytes  blkio.time_recursive              tasks
blkio.io_service_bytes            blkio.io_wait_time_recursive     blkio.throttle.io_serviced       blkio.weight
blkio.io_service_bytes_recursive  blkio.leaf_weight                blkio.throttle.read_bps_device   blkio.weight_device
blkio.io_serviced                 blkio.leaf_weight_device         blkio.throttle.read_iops_device  cgroup.clone_children
[root@dockerhost c2241e5663e04cbf0154b06ecb4fe7f31f67918748c3123ea7104ac8db004dae]# cat blkio.throttle.io_service_bytes
253:4 Read 1540096
253:4 Write 0
253:4 Sync 0
253:4 Async 1540096
253:4 Total 1540096
Total 1540096

Doc for blkio.throttle.io_service_bytes file:

- blkio.throttle.io_service_bytes
    - Number of bytes transferred to/from the disk by the group. These
      are further divided by the type of operation - read or write, sync
      or async. First two fields specify the major and minor number of the
      device, third field specifies the operation type and the fourth field
      specifies the number of bytes.

Don't expect any nice % metric values. They are only counters, so you have to calculate % from the counter values. Just find a counters, which are useful for you (I guess *wait* metrics) and you will be able to detect IO bottleneck.

You can apply similar concept also for: