What’s the fastest way of getting data into an AWS Lambda

amazon s3amazon-lambdaamazon-web-servicesbandwidth

I have something I'll call a "microservice" running on AWS Lambda (using node.js).

Basically it serves up condensed summaries drawn from a few hundred megabytes of binary blob. There are a lot of possible outputs and pre-generating all possibilities isn't an option, and it needs to be reasonably responsive (sub-second at the worst, say) as it's accessed (via API Gateway) from interactive webpages which allow parameters to be changed rapidly. Access patterns in the blob are essentially random, although any summary produced will typically only have accessed ~0.1-1% of the total data. The data and access patterns aren't very compatible with storing the data in a database (although see mention of DynamoDB below).

My current approach is to have the big binary blob hosted on S3, and have the Lambda handlers cache the blob locally between Lambda invocations (just as a buffer in the javascript code, with scope outside the hander function; obviously the Lambda's memory is configured sufficiently large). Handler instances seem to be persistent enough that, once a server is up and running it works well and is very responsive. However there are at least a couple of downsides:

The initial fetch of the data from S3 seems to be at around
50-60MByte/s (seems to be in agreement with other reports on S3
bandwidth I've seen) so there can be an annoying multi-second delay
on first access.
Related to the previous point, if the client is very active and/or
user load increases, more server instances get spun up and users may
find their requests routed to instances which are stalled on fetching
the data blob, which leads to annoying glitches in an otherwise
smoothly functioning client.

I freely admit I'm probably expecting too much from what's really intended to be a "stateless" service by having it actually have a big chunk of state in it (the binary blob), but I'm wondering if anything can be done to improve the situation. Note that the data is not particularly compressible (it might be possible to take 1/3 off it, but that's not the sort of order-of-magnitude I'm looking for, or at least it's just part of the solution at best).

Any suggestions how to get the data into the Lambda faster? The sort of thing I'm imagining is:

Pull your data from somewhere else that Lambdas have much higher bandwidth to… but what? DynamoDB (split into as many 400k binary records as needed)? ElastiCache? Something else on the AWS "menu" I haven't noticed.
Use some cunning trick (what?) to "pre-warm" lambda instances.
You're using completely the wrong tool for the job; use… instead? (I do really like the Lambda model though; no need to worry about all that instance provisioning and auto-scaling, just concentrate on functionality).

If Google or Microsoft's recently announced Lambda-alike offerings (about which I know little) have any attributes which would assist with this use-case better, that'd be very interesting information too.

One option I have contemplated is baking the binary data into a "deployment package" but the 250MByte limit on that is too low for some anticipated use-cases (even if the blob was compressed).

Best Answer

If the binary blob is only a few hundred megabytes, you can just include it with your function as a "dependency". You can add it as a file alongside your code and reference it accordingly.

Another option would be to have two lambda functions. One function does nothing but serve up the blob (which you create as above by sending the blob with the function) and then you can use a timer (cron basically) to "tickle" that function every minute to keep it active. Then your second lambda is the one that does the work, and the first thing it does on startup is call the first lambda to get the blob. Lambda to lambda calls are high bandwidth so the startup time shouldn't be a problem.

The ideal solution would be to figure out a way to summarize the data and store it in DynamoDB, but it sounds like you tried that route and it didn't make sense for you.

Related Solutions

Understanding Amazon AWS usage data

Yeah, unfortunately they don't do a whole lot of analysis for you, if you want any information more detailed than their monthly cost summary, you're going to have to write your own scripts to process the raw data they give you. You can download it in several formats, so it's pretty easy to process, you just have to decide what you want to do with it.

Docker – How to get the IP Address for a specific AWS ECS task

It turns out my original premise (needing to know the task container's own internal IP address for service discovery) is very flawed - that IP address is only usable within a single EC2 Container Instance. If you have multiple container instances (which you probably should have) then those task container IP's are basically useless.

The alternative solution that I came up with is to follow the pattern suggested for Application Load Balancers running HTTP / HTTPS - have a port mapping with 0 as the host port, pointing to the port within the docker instance that I need to use. By doing this, Docker will assign a random host port, which I can then find by using the AWS SDK - in particular, the "describeTasks" function available on the ECS module. See here for details: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/ECS.html#describeTasks-property

This is the fundamental basis for my roll-your-own service discovery mechanism - there are a lot of other details necessary to do this in a complete fashion. I used Lambda functions calling out to the AWS SDK as well as a PostgreSQL database in order to keep my list of host containers up to date (something a bit like a dynamic DNS registry). Part of the trick is you need to know the IP and port for each of the containers, but describeTasks only returns the port. Here is a handy NodeJS function that I wrote which takes a container name and looks for all IP addresses and ports found within the cluster for containers of that name:

var Q = require('q');
/**
 * @param {String} - cluster - name of the cluster to query, e.g. "sqlfiddle3"
 * @param {String} - containerType - name of the container to search for within the cluster
 * @returns {Promise} - promise resolved with a list of ip/port combinations found for this container name, like so:
    [
      {
        "connection_meta": "{\"type\":\"ecs\",\"taskArn\":\"arn:aws:ecs:u..\"}",
        "port": 32769,
        "ip": "10.0.1.49"
      }
    ]
 *
 */
exports.getAllHostsForContainerType = (cluster, containerType) => {
    var AWS = require('aws-sdk'),
        ecs = new AWS.ECS({"apiVersion": '2014-11-13'}),
        ec2 = new AWS.EC2({"apiVersion": '2016-11-15'});

    return ecs.listTasks({ cluster }).promise()
    .then((taskList) => ecs.describeTasks({ cluster, tasks: taskList.taskArns }).promise())
    .then((taskDetails) => {
        var containersForName = taskDetails.tasks
            .filter((taskDetail) =>
                taskDetail.containers.filter(
                    (container) => container.name === containerType
                ).length > 0
            )
            .map((taskDetail) =>
                taskDetail.containers.map((container) => {
                    container.containerInstanceArn = taskDetail.containerInstanceArn;
                    return container;
                })
            )
            .reduce((final, containers) =>
                final.concat(containers)
            , []);

        return containersForName.length ? (ecs.describeContainerInstances({ cluster,
            containerInstances: containersForName.map(
                (containerDetails) => containerDetails.containerInstanceArn
            )
        }).promise()
        .then((containerInstanceList) => {

            containersForName.forEach((containerDetails) => {
                containerDetails.containerInstanceDetails = containerInstanceList.containerInstances.filter((instance) =>
                    instance.containerInstanceArn === containerDetails.containerInstanceArn
                )[0];
            });

            return ec2.describeInstances({
                InstanceIds: containerInstanceList.containerInstances.map((instance) =>
                    instance.ec2InstanceId
                )
            }).promise();
        })
        .then((instanceDetails) => {
            var instanceList = instanceDetails.Reservations.reduce(
                (final, res) => final.concat(res.Instances), []
            );

            containersForName.forEach((containerDetails) => {
                if (containerDetails.containerInstanceDetails) {
                    containerDetails.containerInstanceDetails.ec2Instance = instanceList.filter(
                        (instance) => instance.InstanceId === containerDetails.containerInstanceDetails.ec2InstanceId
                    )[0];
                }
            });
            return containersForName;
        })) : [];
    })
    .then(
        (containersForName) => containersForName.map(
            (container) => ({
                connection_meta: JSON.stringify({
                    type: "ecs",
                    taskArn: container.taskArn
                }),
                // assumes that this container has exactly one network binding
                port: container.networkBindings[0].hostPort,
                ip: container.containerInstanceDetails.ec2Instance.PrivateIpAddress
            })
        )
    );
};

Note that this uses the 'Q' promise library - you'll need to declare that as a dependency in your package.json.

The rest of my custom solution for handing ECS service discovery using Lambda functions can be found here: https://github.com/jakefeasel/sqlfiddle3#setting-up-in-amazon-web-services

Best Answer

Related Solutions

Understanding Amazon AWS usage data

Docker – How to get the IP Address for a specific AWS ECS task

Related Topic