Design Patterns for Serverless, Lambda, DynamoDB, S3

Motivation

We’ve been using AWS load balancers with autoscaling instances for years now and it’s great at handling load, but it’s quite a bit of infrastructure to manage (even with Troposphere + CloudFormation). We also have to manage all the data flow, queue processing and such ourselves: multiple SQS queues, EC2 polling, recording state in databases… More of our code is dedicated to that “plumbing and wiring” than the actual focus of our application.

We’ve been looking at “serverless” options after the announcement of AWS Lambda, and been following the rapid development of the Serverless framework. Recently I started working on a proof of concept, re-grooving our cloudy application for a serverless world. So far, I’m really lovin’ it.

Instead of implementing data flow, finite state machines, queuing and such in our software, we describe that wiring in terms of AWS “events” that trigger “functions” running on Lambda infrastructure. Instead of “data flow as code” we now have “data flow as configuration”.

We’re comfortable with AWS services — and for this exercise, want to avoid using any 24×7 EC2 servers — so our Lambda functions interact with S3 object storage, DynamoDB databases, and Elasticsearch service search engines. Here are some design patterns that have proven to be helpful as we’ve come up to speed using this brave new world; they’re pretty generic problem solving approaches so should be applicable to your applications as well.

Sample Application

The application I’m using for this learning exercise, this proof of concept, is a pretty common pattern. Someone uploads an image to an S3 object store, it gets resized, some data about it is stored in a database, then the data and processed image locations are put into a search engine. A simple query API lets users find images in a variety of sizes. The API is exposed to the public via a responsive Angular web front-end, but that’s a subject for another post.

We upload an image to S3 and it fires an event to an “extract” lambda which stores info in a database. A “resize” lambda is triggered to resize the image and store it in S3, then updates the database entry. The database emits an event stream which is fed to a “publish” lambda which checks to see if both the resize and metadata information is present, and if so, injects the info into our search engine.

Below we show some patterns we’ve found useful for doing this. We’re mostly a Python shop, so the code examples are in Python. The Serverless framework configuration is a YAML file, serverless.yml.

S3 Events

S3 can emit events on object creation and deletion. Normally we care about creation (uploads) and handle it with a lambda that acts on the creation.

We can create events in each lambda function definition separately for each event type, so we can have a different function module for each event, like:

functions:
  extract:
    handler: extract.handler
    events:
      - s3:
          bucket: images-in
          event: s3:ObjectCreated:*
  nuke:
    handler: nuke.handler
    events:
      - s3:
          bucket: images-in
          event: s3:ObjectDeleted:*

This, in my case, would invoke two python modules, extract.py and nuke.py, each of which has its own handler function. Easy enough, but could be a bit too fine-grained if there’s a lot of redundant code in the two files.

Instead, we could have one module which processed both (all) event types, perhaps using different handler function in the Lambda:

functions:
  create:
    handler: s3in.handle_create
    events:
      - s3:
          bucket: images-in
          event: s3:ObjectCreated:*
  delete:
    handler: s3in.handle_delete
    events:
      - s3:
          bucket: images-in
          event: s3:ObjectDeleted:*

Here, we’d have a single s3in.py module with two handler functions, handle_create() and handle__delete(). If both share some code, this reduces repetition.

Or we could have one module and one handler and let the handler discriminate based on examination of the event:

functions:
  extract:
    handler: s3event.handler
    events:
      - s3:
          bucket: images-in
          event: s3:*

I expect it’s more reliable to let the Lambda event/handler mapping do the discrimination, as it saves a step in the handler in the last example.

DynamoDB Streams

Unlike S3 events, DynamoDB streams emit information about the changed rows. The record contains an eventName like “INSERT”, “MODIFY” or “REMOVE”. We don’t get separate events we can discriminate on in the severless.yml file.

This means that our handler must handle all event types. We can do something like:

eventname = record['eventName']
if eventname == 'REMOVE':
    self.delete()
....
raise Exception('Unimplemented: id={} ignoring eventname={}'.format(self.id, eventname))

This is a pretty straight-forward pattern.

Below we show how we could switch on these with a minimal top-level handler() function.

Handler structure

I’m finding it convenient to have a minimal handler function that loops over incoming S3 events or DynamoDB streams. I frequently see a bunch of DynamoDB records come in at once to my lambda — we can’t simply assume we get a single event and get only records[0].

So my handler tries to be as dumb as possible, looping over the triggers and calling a class to do the work:

def handler(event, context):
    try:
        for record in event['Records']:
            AssetDDBRecordHandler(record)
    except Exception as e:
        msg = 'ERROR asetddb.handler: {}'.format(e)
        log.error(msg)
        return {'event': event, 'message': msg}
    return {'event': event,
            'message': 'Function executed successfully: asset.handler'}

This also makes it easy to wrap the record handler in an exception handler that logs the exception — which lets the Lambda complete successfully instead of throwing the exception which causes Lambda to needlessly retry the permanently-failed event (which the Lambda infrastructure presumably does to overcome transient errors, overload conditions, etc).

In the class, the constructor examines the event type (e.g., ADD, REMOVE, …) and invokes a method specific to the event — a simple dispatcher:

class AssetDDBRecordHandler:
    def __init__(self, record):
        self.id = record['dynamodb']['Keys']['id']['S']
        eventname = record['eventName']  # INSERT, MODIFY, REMOVE
        if eventname == 'REMOVE':
            self.delete()
        elif eventname == 'INSERT':
            self.insert()
        raise Exception('Unimplemented: id={} ignoring eventname={}'.format(self.id, eventname))
    def delete(self):
        try:
            res = es.delete(index='images', doc_type='image', id=self.id)
        except Exception as e:
            raise Exception('id={} deleting Elasticsearch index: {}'.format(self.id, e))

Note that the code that actually does the work is free to throw detailed exceptions up to the top-level handler() since it will catch them and log instead of blowing out the lambda.

DynamoDB Streams Native Protocol Deserialization

When we get S3 events in Lambda, we get a clean structure we can dissect easily to get the event, bucket, key and whatnot as native Python structures. When we access DynamoDB through the Boto3 library using the Table model, we can also read and write the Python structures easily.

But the stream records we get from DynamoDB in our Lambda are encoded with a low-level serialization protocol like the one you have to use when you work with Boto3’s DynamoDB Client model. Each datum is a dict indicating type and value, like:

{u'_dt': {u'S': u'2017-02-08T13:30:38.915580'},
 u'id': {u'S': u'ALEX18'},
 u'metadata': {u'M': {u'description': {u'S': u'12-year-old...'}}} }

So we have to deserialize it to process. It’s tempting to think this is easy to write, but realize that DynamoDB records can have nested elements, like the u’M’ map (Python dict) above, so it becomes a chore to roll your own.

“There’s got to be a better way!”. Happily, there is.

Boto3 has to do this, and it has a deserializer (and serializer) we can simply import and use. This is how I do it:

from boto3.dynamodb.types import TypeDeserializer

deserialize = TypeDeserializer().deserialize
for record in event['Records']:
    data = {}
    new = record['dynamodb'].get('NewImage')
    if new:
        for key in new:
            data[key] = deserialize(new[key])
    id = data['id']

Now we can work with data as native Python objects.

We could get clever and use a Python dict-comprehension to combine three lines into one:

data = {k, deserialize(v) for k, v in new.items()}

Conclusion

We’re certain to come up with more design patterns that make the resource-event-function wiring easier, and the lambda processing more self-contained, but the above have emerged quickly and naturally as ways to structure our project. It’s been a fun and rewarding exercise which has given us the confidence to go all-out on future projects in a serverless, event-driven manner.