Python – Upload to S3 bucket slows over time

amazon-web-servicesaws-clipython

I'm part way through uploading about 200,000 files (each is ~1MB max) to an S3 bucket from an EC2 instance (both in Europe West).

From monitoring the EC2 with CloudWatch (looking at the NetworkOut metric), there seems to be a drop-off in the upload transfer over time:

enter image description here

I'm uploading the files in several tranches and the drop-off seems consistent, usually after four or five hours (but it sometimes occurs more quickly).

The files are uploaded with a Python script, which:

  1. Downloads a .zip from a third party server
  2. Extracts about 25 files from the .zip and gzips each file
  3. Uploads the .gzip files to the bucket

I've tried two ways of uploading the .gzip files…

  • Sequentially, using boto3: boto3.client("s3").upload_file(file.gz, bucket, file.gz)
  • Running the AWS CLI as a subprocess to upload 25 .gzip files at a time

…But I saw the same drop-off with each method.

What could be causing this? Or what information should I collect to debug it?

Edit

Here's a chart for the same period, showing the BurstBalance metric (the EC2 instance is a t2.small):

enter image description here

Here's CPUCreditBalance:

enter image description here

Best Answer

My best guess is it's your EBS I/O credits. Monitor this with the BurstBalance CloudWatch metric. Please check, post a graph, and if it's not that I'll think some more.

Update - that third graph I asked you to add shows that you've run out of CPU credits. Your CPU is being throttled. You can either accept the slower performance or temporarily change to more suitable instance.

This looks quite CPU intensive. You could move to a t2 large and get four times the CPU allowance, or I'd probably move to a general purpose m4 instance for a while. Changing instance type is easy - stop the instance, right click, change instance type, then start it again.