Cloudwatch Agent high CPU usage

amazon-cloudwatch

We are using AWS Cloudwatch to monitor CPU usage, p99 latency for API calls etc. The problem is during peak traffic Amazon Cloudwatch Agent itself is having 25%-35% of CPU usage, thus largely contributing to the high CPU usage trigger. I have observed a direct correlation between p99 latency metrics and CPU usage metrics.

  1. Is it normal for monitoring tools to be hard on system resources?
  2. Is there a way to optimize the Amazon Cloudwatch Agent to utilize low system resources?

I'm pasting the config file of Amazon Cloudwatch here:

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  logfile = "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
  logtarget = "lumberjack"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.cpu]]
    fieldpass = ["usage_active"]
    interval = "10s"
    percpu = true
    report_active = true
    totalcpu = false
    [inputs.cpu.tags]
      "aws:StorageResolution" = "true"
      metricPath = "metrics"

  [[inputs.disk]]
    fieldpass = ["total", "used"]
    interval = "60s"
    mount_points = ["/", "/tmp"]
    tagexclude = ["mode"]
    [inputs.disk.tags]
      metricPath = "metrics"

  [[inputs.logfile]]
    destination = "cloudwatchlogs"
    file_state_folder = "/opt/aws/amazon-cloudwatch-agent/logs/state"

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/access-logs-app2/app.log.*"
      from_beginning = true
      log_group_name = "access-logs-app2"
      log_stream_name = "access-logs-app2"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/webhooks-logs-app2/webhook.log.*"
      from_beginning = true
      log_group_name = "webhooks-logs-app2"
      log_stream_name = "webhooks-logs-app2"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/access-logs-app/app.log.*"
      from_beginning = true
      log_group_name = "access-logs-app"
      log_stream_name = "access-logs-app"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/webhooks-logs-app/webhook.log.*"
      from_beginning = true
      log_group_name = "webhooks-logs-app"
      log_stream_name = "webhooks-logs-app"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/query-logs/**"
      from_beginning = true
      log_group_name = "db-query-logs"
      log_stream_name = "db-query-logs"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/var/log/nginx/some_name.*"
      from_beginning = true
      log_group_name = "some_name-nginx"
      log_stream_name = "some_name-nginx"
      pipe = false
    [inputs.logfile.tags]
      metricPath = "logs"

  [[inputs.mem]]
    fieldpass = ["used", "cached", "total"]
    interval = "60s"
    [inputs.mem.tags]
      metricPath = "metrics"

[outputs]

  [[outputs.cloudwatch]]
    force_flush_interval = "60s"
    namespace = "CWAgent"
    profile = "www-data"
    region = "ap-south-1"
    shared_credential_file = "/var/.aws/credentials"
    tagexclude = ["metricPath"]
    [outputs.cloudwatch.tagpass]
      metricPath = ["metrics"]

  [[outputs.cloudwatchlogs]]
    force_flush_interval = "5s"
    log_stream_name = "production"
    profile = "www-data"
    region = "ap-south-1"
    shared_credential_file = "/var/.aws/credentials"
    tagexclude = ["metricPath"]
    [outputs.cloudwatchlogs.tagpass]
      metricPath = ["logs"]

Best Answer

I have the same question, & you've answered it for me. I run a mail server, a dns server, and a web server (front end to a separate RDS database instance). I used to run it all on one t2.nano instance (not a CPU powerhouse!), without breaking a sweat (keeping the CPU credit balance locked at 72 without any deflections).

Then I added the following four lines to a cron job that ran every minute (each line has a different metric-name):

aws cloudwatch ... --value $(($(df --output=avail /     | tail -1)*1024))
aws cloudwatch ... --value $(($(df --output=avail /home | tail -1)*1024))
aws cloudwatch ... --value $(free -b | sed -r  's:Mem([^0-9]*([0-9]*)){6}.*:\2:p;d')
aws cloudwatch ... --value $(free -b | sed -r 's:Swap([^0-9]*([0-9]*)){2}.*:\2:p;d')

That resulted in a continuous decrease in my CPU credit balance, so I changed the cron interval to five minutes, which stabilized my credit balance, with no apparent further decrease or increase. That's ridiculous!

The eventual cure? I figured it was time to upgrade to a t3.nano instance (two vCPUs rather than one), which I did. Now, with the replacement cron job (below) running every minute, it accumulates CPU credits at the rate of 5/hour. Working the math with the first cron job file running every minute, it figures out to a rate of 0.4 CPU credits per hour per aws cloudwatch statement.

It appears that you can combine sending multiple metrics in one aws cloudwatch statement, which executes in the same time as one statement above:

{ cat <<EOF
[
 {
  "MetricName": "EC2 root",
  "Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
  "Value":      $(($(df --output=avail /     | tail -1)*1024)),
  "Unit":       "Bytes"
 },
 {
  "MetricName": "EC2 home",
  "Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
  "Value":      $(($(df --output=avail /home | tail -1)*1024)),
  "Unit":       "Bytes"
 },
 {
  "MetricName": "EC2 free",
  "Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
  "Value":      $(free -b | sed -r  's:Mem([^0-9]*([0-9]*)){6}.*:\2:p;d'),
  "Unit":       "Bytes"
 },
 {
  "MetricName": "EC2 swap",
  "Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
  "Value":      $(free -b | sed -r 's:Swap([^0-9]*([0-9]*)){2}.*:\2:p;d'),
  "Unit":       "Bytes"
 }
]
EOF
} | aws cloudwatch put-metric-data --namespace MySpace --metric-data file:///dev/stdin

[Note the use of "heredoc" syntax to allow expressions to be evaluated inside a "text" file.]

Who knows what the CloudWatch Agent is doing. I came here looking to see if running the CloudWatch agent would be more efficient than using individual aws cloudwatch statements. Apparently not.