Pick Our Brain

Open Source Contributions + Knowledge Sharing = Better World

July 1, 2025

Test

By Bảo Huỳnh

Share this post:
: Test
April 8, 2025

Science

By Bảo Huỳnh

science

Share this post:
: Science

Science: Like magic, but real! Today, we’re celebrating National Science Appreciation Day by geeking about the everyday wonders around us. From the code that powers your apps to the physics that makes 3D animation possible, we live in a world where ‘impossible’ things happen every day. And the best part? We can explain how!

#ScienceIsAwesome #RealMagic

October 18, 2023

Desperately seek()ing S3 Mountpoints

By Chris Shenton

audio, aws, ec2, image, metadata, s3, video

Context: images.nasa.gov metadata

In our work on images.nasa.gov, we extract metadata from image, video, and audio files to make them searchable. Some of these files can be quite large, especially the 4K video files we’re seeing now.

The entire site runs on Amazon Web Services (AWS), with our media files stored on S3 object storage, and metadata extraction running on EC2 autoscaling instances. We download the media files from S3 to EC2, then use ExifTool to pull out EXIF, IPTC, and XMP data including date, title, photographer, camera, geolocation, size, and other technical details.

This has worked fine for us, but presents two concerns: the copy can take nontrivial time, even over AWS internal network; the file may be too big for our currently configured EC2 EBS storage.

Talk Nerdy to Me

Although ExifTool can read data streamed from S3, that sequential access requires it to read the entire file; this is just as slow as copying it, and trades RAM for disk.

But once a file is copied to disk, ExifTool can use “random access” — instead of sequential access — to the file’s structure to skip through the parts it doesn’t need (e.g., large video segments), in order to find what it needs (metadata, in our case). The table below summarizes two posts on MP4 video file format, showing how it’s composed of multiple “boxes” with header containing box size, and type, and the content itself; below, the box sizes and sample content are illustrative only:

Box Header (8 bytes)		Box Data (N bytes)
Size (4 bytes)	Type (4 bytes)	Box Data (N bytes)
2 K	ftyp File Type	majorBrand(int:4), minorVersion(int:4), compatibleBrands(int:32[])
4 GB	moov Movie Boxes…	MvhdBox{Flags(int:4) Version(int:4), CreateTime(int:4), ModTime(int:4), TimeScale(int:4), Duration(int:4), Rate(int:4), Volume(int:2)}[]
…	[others]	…
16 K	meta Metadata	Width, Height, Duration, Resolution, …; Photographers, Dates, Copyright

With sequential access, we’d have to read the 2KB + 4GB data just to get at the metadata. With random access, we read the first 8 bytes to find that it’s “ftyp” and 2KB size, then use the standard UNIX seek() call to skip ahead 2K bytes. We do the same with the next, encountering the “moov” type, then doing a seek() 4GB ahead. We ultimately find our “meta” header then read the 16K metadata we want. The total we’ve had to read is 8 + 8 + 16KB, so just 16 bytes overhead instead of over 4GB.

All this seek()ing is great but our objects live on S3, rather than the local EC2 filesystem. What to do?

HTTP supports byte range requests and S3 Get Object supports this mechanism too. We could conceivably modify ExifTool to simulate seek() and its partners tell() and read() using HTTP byte range requests when talking to S3 objects. I’ve looked at hacking the ExifTool RandomAccess.pm code but it’s but it’s been eons since I’ve written any Perl code and now find it challenging to read. It’s certainly do-able, but it’s a matter of time and priorities, and I’ve just sat on this problem for quite a while.

AWS Mountpoint for Amazon S3

On August 9, 2023, AWS announced this new capability, Mountpoint for Amazon S3 – Generally Available and Ready for Production Workloads. The excerpts below made it sound exactly like what we were looking for:

“…open source file client that makes it easy for your file-aware Linux applications to connect directly to Amazon Simple Storage Service (Amazon S3) buckets.”

“… large-scale read-heavy applications: data lakes, machine learning training, image rendering, autonomous vehicle simulation, ETL, and more. It supports file-based workloads that perform sequential and random reads, sequential (append only) writes, …”

“…many customers have existing applications, commands, tools, and workflows that know how to access files in UNIX style: reading directories, opening & reading existing files, and creating & writing new ones.”

It was easy try it out, we just install an RPM to our AWS Linux EC2 instance, create a mountpoint, then mount our S3 bucket; something like:

yum install -y https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.rpm
mkdir /s3
mount-s3 –read-only my-bucket-o-videos /s3

Now we can run the unmodified ExifTool against some sample files, make sure it works, and check the timing. If everything works as we hope, it should seek() through the file and quickly give us the metadata.

Let’s see:

$ time exiftool /s3/First-8K-Video-from-Space.mp4

File Name : First-8K-Video-from-Space.mp4

Directory : /s3

File Size : 3.3 GB

Major Brand : MP4 v2 [ISO 14496-14]

Duration : 0:03:09

Compressor Name : HEVC Coding

Creator Tool : Adobe Premiere Pro CC 2018.1 (Windows)

real 0m0.709s

It took 0.7 seconds to find and extract the metadata from an 3.3GB file; clearly we weren’t naively copying, but quickly seek()ing. By comparison, a copy from S3 to EC2 (using the Python Boto3 library) took 25 seconds, with Exiftool taking about 0.6 seconds.

We can do the same with an image file that has much more user-friendly metadata; an excerpt:

File Name : alex5.jpg

Directory : /s3/image/embedded

File Size : 2.5 MB

Image Description : 12-year-old Alex Frye checks his special viewing glasses prior to viewing the partial solar eclipse from a highway overpass in Arlington, VA, Thursday, Oct. 23, 2014, . Photo Credit: (NASA/Bill Ingalls)

Make : NIKON CORPORATION

Camera Model Name : NIKON D4

X Resolution : 300

Y Resolution : 300

Software : Adobe Photoshop Lightroom 5.6 (Macintosh)

This was also fast, but not as demonstrative since the file is relatively small.

See How They Run

The release states that the S3 Mountpoint runs on EC2 and even local environments:

Mountpoint can be used from an Amazon Elastic Compute Cloud (Amazon EC2) instance, or within an Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (EKS) container. It can also be installed on your existing on-premises systems, with access to S3 either directly or over an AWS Direct Connect connection via AWS PrivateLink for Amazon S3.

The test results, shown above, were from a deployment on a conventional t2.micro EC2 instance.

My first test was actually local, on my Mac. I built a Docker container based on amazonlinux, installed the S3 Mountpoint RPM and ExifTool, then issued the mount command and ran ExifTool. To run the container, I had to give it extra privileges to the FUSE filesystem, and of course supply my AWS credentials so it could access S3:

docker run -it –rm

–cap-add SYS_ADMIN –device /dev/fuse

-v $(HOME)/.aws/credentials:/root/.aws/credentials:ro

–env AWS_PROFILE

s3-mount-test

It worked great and was surprisingly fast, even over the WAN to AWS. I also tried some manual seek() and tell() calls in Python to make sure I wasn’t missing something; nope, the S3 mountpoint behaved like a local file, for my purposes anyway.

Lambda and Fargate: no soup for you!

We’re big fans of serverless and are migrating more of our existing applications to AWS Lambda. Conspicuously absent from the release text above is any mention of Lambda. I expect this is because (even when launching Lambdas based on container images) it does not have the elevated privileges it needs for the mount, or maybe the Lambda base environment doesn’t support the FUSE filesystem.

I was hoping I could use Fargate where I cannot use Lambda, especially since the release text mentions ECS and EKS, but no dice; the README.md says:

These elevated privileges aren’t available in AWS-managed container environments like [AWS Fargate](https://aws.amazon.com/fargate/).

I hope AWS will consider adding this capability to Lambda and Fargate; both already support mounting NFS-like file shares with Amazon Elastic File System (to use EFS, I’d have to copy the file, which I’m trying to avoid).

Conclusion

Mountpoint for Amazon S3 is a great new service, especially for folks that have existing code or tools they can’t change which expect UNIX file system semantics. It doesn’t support all POSIX semantics, but enough to be useful for a large range of problems.

It could be boon to scientists with huge data files, like those used by NASA and NOAA: Hierarchical Data Format (HDF), Geospatial Data Abstraction Library (GDAL), NASA Planetary Data System, Version 4 (PDS4).

If you have any insight into using S3 Mountpoints, or something like them, for Lambda or Fargate, please get in touch!

May 11, 2021

IAM Auth for Django Database: passwordless, not painless

By Chris Shenton

aws, cloudformation, django, iam, postgresql, rds, wagtail

TL;DR:

Adding IAM Auth requires increasing the RDS server at least 4x the cost for a server with password auth, and likely much more for production. This makes it non-viable for our immediate use-case with a relatively low-stress app.

Goal: no database passwords in code/configs

We’re running a Wagtail CMS site (built on Django) and we don’t want to use passwords to authenticate Wagtail/Django to our database or other AWS services since they present a risk if discovered and are hard to manage outside our committed code repo.

Typical access to PostgreSQL databases use credentials including host, database, username, and password, and Django settings include these. We do not want to store creds in code, so we want to leverage AWS IAM Roles to permit connections — a cloud-native mechanism.

We are using SES for Django email with AWS IAM authentication, obviating the need for passwords to authenticate to their SMTP server. We want to do the same for RDS access to our PostgreSQL database.

This gets pretty nerdy, but hope it helps other user IAM for Django RDS auth, and avoid a little-publicized problem in the current AWS implementation.

IAM Auth, EC2 Role, Django DB Wrappers

I first tried the approach from https://stackoverflow.com/a/57923227/4880852 with the 10-line wrapper in your.package.postgresql/base.py and it worked great — for 15 minutes, after which the temporary token it got expired. Oops.

Below we describe how we used the package mentioned in that post, with IAM and an EC2 Role. In the examples below, our top-level database is default “postgres” with user “rdsiamuser”, and our Django database is “rdsiam”. We use CloudFormation for our Infrastructure-as-Code to create all our resources.

Enable IAM Auth in CloudFormation

Enable IAM Auth in the CloudFormation definition of the database:

SQLDatabase:

Type: ‘AWS::RDS::DBInstance’

Properties:

Engine: postgres

DBName: !Ref DBName

MultiAZ: !Ref MultiAZDatabase

MasterUsername: !Ref DBUser

MasterUserPassword: !Ref DBPassword

EnableIAMDatabaseAuthentication: true

DBInstanceClass: !Ref DBInstanceClass

AllocatedStorage: !Ref DBAllocatedStorage

We have to add an IAM Role to our EC2 instance so it can use IAM Auth to talk to RDS; in CloudFormation, like this:

RolePolicies:

Type: AWS::IAM::Policy

Properties:

PolicyName: root

PolicyDocument:

Version: ‘2012-10-17’

Statement:

– Effect: Allow

Action: rds-db:connect

Resource: !Sub “arn:aws:rds-db:${AWS::Region}:${AWS::AccountId}:dbuser:*/${DBUser}”

The “dbuser” is a literal AWS term, the ${DBUser} is the same as our Django setting, or top-level RDS user. We have to use the wildcard “*” for the database because CloudFormation gives us no way to determine the RDS DbResourceId! 🙁

Set IAM auth for RDS user

After CloudFormation creates the EC2 and RDS, we can connect to the DB with the initial password we gave it then give the initial DBUser the rds_iam permission to authenticate. Since my RDS was inside a private VPC, I found it easiest to launch a PostgreSQL docker container on the EC2 and run the command there:

ec2# docker run -it postgres:alpine bash

docker# psql -h MyRdsDbHost.us-east-1.rds.amazonaws.com -U rdsiamuser postgres

postgres=> GRANT rds_iam TO rdsiamuser;

postgres=> du;

Role name | Attributes | Member of

—————–+———————————+————————-

rdsiamuser | Create role, Create DB +| {rds_superuser,rds_iam}

| Password valid until infinity |

Now you can see the user has the rds_iam role, so it can use IAM to auth. Warning: this will prevent that user from logging in with normal password credentials! I didn’t see this mentioned in the AWS docs.

Of course the goal is to not have passwords in code, and above we show our CloudFormation which likely has the password committed to the repo. But after GRANTing rds_iam, the password no longer works, so this security bug turns into a security feature. I see no way of setting this GRANT at RDS creation time or any other mechanism to allow IAM auth through CloudFormation.

Django config

I then used the code referenced in the post at https://github.com/labd/django-iam-dbauth , adding it to my requirements.txt. After reading the code and walking through it I realized the README docs were incomplete and we must supply a region for it to work. This is what I ended up with in my settings/dev.py file:

DATABASES = {

‘default’: {

‘HOST’: os.environ[‘DATABASE_HOST’],

‘NAME’: os.environ[‘DATABASE_NAME’],

‘USER’: os.environ[‘DATABASE_USER’],

‘ENGINE’: “django_iam_dbauth.aws.postgresql”,

“OPTIONS”: {

“use_iam_auth”: True,

“region_name”: “us-east-1”,

}

Look, ma — no PASSWORD! Then ENGINE and OPTIONS were the critical bits; the other info comes from the environment our Docker container runs it, pretty standard.

I could then run Django on my EC2 (we do it inside Docker) and it worked great for more than 15 minutes, so we knew the django-iam-dbauth worked without token timeouts. All was well… until it wasn’t.

Django Failures, rdsauthproxy Failures

After about an hour, I started seeing authentication failures in the Django logs. We’re running it in gunicorn and saw:

[2021-05-06 18:23:31 +0000] [52] [DEBUG] GET /

psycopg2.OperationalError: FATAL: PAM authentication failed for user “rdsiamuser”

FATAL: pg_hba.conf rejects connection for host “10.42.9.163”, user “rdsiamuser”, database “rdsiam”, SSL off

and later, timeouts from gunicorn, presumably because Django could not auth to build a response:

[2021-05-06 19:44:35 +0000] [50] [CRITICAL] WORKER TIMEOUT (pid:63)

[2021-05-06 19:44:36 +0000] [50] [WARNING] Worker with pid 63 was terminated due to signal 9

This was a test instance with virtually no load except ALB health probes (GET /). RDS was on a db.t3.micro which had been fine for our developers exercising it when using typical password auth, so something broke when we switched to IAM. It seemed to recover after a while, briefly, then failed again and never did recover.

A look at the RDS Logs showed the cause of the problem:

* connect to 127.0.0.1 port 1108 failed: Connection refused

* Failed to connect to rdsauthproxy port 1108: Connection refused

* Closing connection 0

2021-05-06 20:35:10 UTC:10.42.9.163(35716):rdsiamuser@rdsiam:[31065]:LOG: pam_authenticate failed: Permission denied

2021-05-06 20:35:10 UTC:10.42.9.163(35716):rdsiamuser@rdsiam:[31065]:FATAL: PAM authentication failed for user “rdsiamuser”

2021-05-06 20:35:10 UTC:10.42.9.163(35716):rdsiamuser@rdsiam:[31065]:DETAIL: Connection matched pg_hba.conf line 13: “hostssl all +rds_iam all pam”

2021-05-06 20:35:10 UTC:10.42.9.163(35718):rdsiamuser@rdsiam:[31067]:FATAL: pg_hba.conf rejects connection for host “10.42.9.163”, user “rdsiamuser”, database “rdsiam”, SSL off

It appears that RDS has a proxy for PostgreSQL called “rdsauthproxy” but it died for some undeclared reason. It may have come back once or twice but eventually went down permanently and the IAM auth never worked again. I was able to stop then restart the RDS in the AWS console and the rdsauthproxy would come back, but it would soon go down again without a trace.

I found only one hit on this topic, from August 2020, with a “me too” from March 2021, and have posted my “me too” reply; zero response from AWS: https://forums.aws.amazon.com/thread.jspa?threadID=326681

AWS says too small, known problem

I filed a support ticket with AWS and they said that my t3.micro instance had a “CPU Baseline: 10%” and that it had been consistently above this, and that the T3 burstable “CPU Credit” dropped to 0 so it had used all its credits; Freeable Memory dropped to very low, and swap was high.

It sounded like I was thrashing the underlying T3 instance and exhausting resources, and that was probably what was killing rdsauthproxy. But why was it fine, under much higher developer load, when using Password auth instead of IAM auth?

AWS Support then went on to say:

while our internal team is indeed investigating further on this, since this issue is not really a bug and is mostly related to resource throttling, there might not really be a “fix” for it, as such. Therefore, we recommend all our customers to ensure that they have enough resources to have a seamless experience and avoid such scenarios.

We have no guidance of how much we have to scale up our RDS instance before it will stop dying: double? quadruple? different instance type? who knows… 🙁

db.t3.small seems to work — then falls over

I tweaked my CloudFormation to replace the micro instance with a db.t3.small and redeployed. Happily, it copied all the data and after a while the app started working again. Still, we’re not putting any load on this test instance. For our QA and Prod servers, we’ll have to watch load carefully; maybe use a db.m5.* instance instead of burstable db.t3.* instance.

About 12 hours later, with zero app load, we saw auth failures and RDS load went from < 10% to about 30%. So this size is too small for anything with IAM auth.

Adding IAM auth doubles, quadruples, octuples cost

Our db.t3.micro was running for months for our devs, demoing to our customer, it never fell over and was snappy enough; it costs $0.018/hour, or $13/month — totally reasonable for a Dev and maybe QA instance.

The db.t3.small fell over in 12 hours, and costs twice as much. A db.t3.medium costs twice that. So our dev instance is now costing 4x what it used to in order to support IAM auth, and we don’t know if it will fall over under anything but minimal load.

For QA, we’d need at least the same size, 4x the cost of the micro.

For Prod, we’d need 2x instances and probably an “m” instance so we don’t run out of “t” burst credits. Minimum in that class is db.m6g.large at $0.159/hour, that’s almost 10x the cost of our micro, and we need 2 for failover, about $230/month. That’s a lot of money for our fairly-small commercial app’s database, especially since if we used the same in Dev and QA (with only one instance each), it adds another $230/month. $500/month for a small app with Dev, QA, Prod. Not a way to win customers.

The point is the overhead of running IAM auth is causing us to increase or DB cost by roughly 4x – 10x compared to password-based auth!

Could we switch to Aurora? Let’s presume it can do IAM auth without falling over. If we go to the calculator and pick the lowest price option (db.r4.2xlarge) the price is $847! OK, that option’s out.

Single instance type	$/month	Comments
RDS db.t3.micro	$13	fine for our Dev, falls over for IAM auth
RDS db.t3.small	$26	more than enough for Dev, falls over for IAM auth
RDS db.t3.medium	$52	might be OK for IAM auth
RDS db.t3.large	$104	hopefully OK for IAM auth
RDS db.m6g.large	$115	hopefully OK for IAM auth
Aurora db.r4.2xlarge	$847	too much for our use case

Maybe if our DB needs were big, and we already required a db.t3.xlarge or db.m6g.large the IAM penalty wouldn’t be noticeable, but it’s noncompetitive for our use case.

Locked Yourself Out?

So you’ve GRANTed rds_iam to your PostgreSQL user and locked yourself out of your database. Now what?

You can use the AWS console to disable IAM authentication, and wait for it to reconfigure, and try later.

Or you can use the technique from

https://aws.amazon.com/premiumsupport/knowledge-center/rds-postgresql-connect-using-iam/

to get a 15-minute password token, just like our Django plugin uses via Boto3 calls.

On the EC2 with access to the RDS, use the “aws” CLI to generate a token. If this isn’t installed on your EC2, you can use their Docker image. Generate an auth token:

token:

export PGPASSWORD=”$(aws rds generate-db-auth-token –hostname $RDSHOST –port 5432 –region us-east-1 –username rdsiamuser)”

Then, connect using a Dockerized Postgres client on the EC2::

docker run -it

-e PGPASSWORD=$PGPASSWORD

postgres:alpine

psql -h $RDSHOST -p 5432

“sslmode=require dbname=postgres user=rdsiamuser”

This takes about 8 seconds on the first connection but it’s quick on subsequent connections; maybe the rdsauthproxy has a cache. Then you should be able to create another user with normal password creds or perhaps revoke the IAM creds to restore password login:

postgres=> REVOKE rds_iam FROM rdsiam;

Then you should be able to login with a password and do whatever else you need.

June 1, 2020

V! Studios Wins 2020 Communicator Award of Distinction for Online Video

By Mike Brody

after effects, animation, NASA, science, video

Share this post:
: V! Studios Wins 2020 Communicator Award of Distinction for Online Video

V! Studios has received a 2020 Communicator Award of Distinction for its online video series, NASA ScienceCasts. The NASA ScienceCast series highlights scientific research and discoveries, keeping audiences informed, advancing understanding, and bringing wonder through animation and visualizations.

The Communicator Awards are judged and overseen by the Academy of Interactive and Visual Arts (AIVA), an assembly of leading professionals from various disciplines of the visual arts dedicated to embracing progress and the evolving nature of traditional and interactive media.

NASA ScienceCasts have invited viewers to learn more about topics such as studying forest height using laser light from space, to black holes, to particle physics on the International Space Station. Episodes are being produced in 4K for audiences to enjoy across many online platforms including Facebook, YouTube, Twitter, iTunes, as well as being broadcast on NASATV.

With over 6,000 entries received from across the US and around the world, the Communicator Awards is the largest and most competitive awards program honoring creative excellence for communications professionals. “We are extremely proud to recognize the work received in the 26th Annual Communicator Awards. This class of entries embodies the best of the ever-evolving marketing and communications industry” noted Eva McCloskey, managing director of the AIVA.

The Communicator Awards is the leading international awards program recognizing big ideas in marketing and communications. Founded nearly three decades ago, The Communicator Awards honors work that transcends innovation and craft – work that made a lasting impact, providing an equal chance of winning to all entrants regardless of company or agency size and project budget.is one of the largest awards of its kind in the world.

Headquartered in Tysons Corner, VA, V! Studios is a unique hybrid company, successfully combining left brain and right brain skills to weave technology, information, and the arts into innovative and effective products and services. Learn more about V! Studios services at: V-Studios.com

8.3.3
September 9, 2019

Serverless Step Functions with Callback

By Chris Shenton

aws, cloud, lambda, serverless, statemachine, stepfunctions

Share this post:
: Serverless Step Functions with Callback

This is a demo of how you can use the “callback” pattern to restart a Step Functions state machine from within a Lambda function. It took me a while to dig through the AWS docs, sample code, and examples to unlock the mysteries, so I hope it saves you some time.

It is inspired by Ross Rhodes’ tweet on callbacks with Step Functions. He used the AWS Cloud Development Kit and SQS, but I’ll be using the Serverless Framework with direct Lambda calls because it’s a pattern that comes up repeately in our use cases. Ben Kehoe wrote an excellent AWS Blog post on the same topic; he’s using SNS Email for human approvals.

The SNS is also not exactly alined with our current use cases, but SQS- and SNS-driven restarts are both likely something we’ll need at some point.

All the code here is on our GitHub: https://github.com/v-studios/serverless-stepfunctions-callback

Our Real Use Case

Our application takes a file and uses a Lambda to split it up into chunks which are dropped onto S3. Each of those chunks’ S3 CreateObject event triggers a Lambda to process the chunk, so all the chunks get prococessed in parallel. Some chunks take longer than others, so once we determine that all the chunks are done, we want to restart our state machine. We do this by calling Step Functions API directly, indicating success.

Demo Implementation

This demo code skips the complexity of our real app, allowing us to focus on the state machine stop and restart. We’ll use a random chance to decide when we’re done, with a chance that the processing function fails, so we can signal the failure. Our state machine has a handler for this, so it can do different things on success and failure.

Our preferred backend language is Python, so that’s what we’ll use for our Lambda handler. Translating to Node or some other Lambda language should be trivial: just map the two API calls we make to your Step Functions SDK.

We’ve been using the Serverless Framework for a while for our commercial and government projects and really like it: it’s a pleasure to use and makes all the boring stuff go away. It takes care of the infrastructure so we don’t need to do our own CloudFormation, nor its shiny new cousin, Cloud Development Kit. Under the covers, Serverless does CloudFormation for us, and that’s just where it should be — under the covers, so we can inspect it if we need to, and ignore it most of the time.

Takahiro Horike’s Step Function plugin for the Serverless Framework makes it a breeze to describe state machines directly in our serverless.yml file.

Get it Running

Install the dependencies:

npm install

Assuming you’ve set your AWS credentials in your environment (we set AWS_PROFILE), deploy with Serverless; we use the default us-east-1 region and stage dev:

sls deploy

When done, you should see your functions and an HTTP endpoint we created to start the state machine:

Serverless: Packaging service…
…
Serverless: Stack update finished…
Service Information
service: serverless-stepfunctions-callback
stage: dev
region: us-east-1
stack: serverless-stepfunctions-callback-dev
resources: 15
api keys:
None
endpoints:
functions:
SplitDoc: serverless-stepfunctions-callback-dev-SplitDoc
ProcessAndCheckCompletion: serverless-stepfunctions-callback-dev-ProcessAndCheckCompletion
layers:
None
Serverless StepFunctions OutPuts
endpoints:
GET – https://yoururlhere.execute-api.us-east-1.amazonaws.com/dev/start

In the AWS console, you should see your state machine under Step Functions – State machines.

You can get details by clicking on the name; click the Definition tab to get the diagram.

Under the “Executions” tab, you can “Start execution”, and leave the default input alone. Depending on chance, it should go through ContinueProcess and succeed, or ProcessingFailed and fail. We can examine the inputs and outputs of each state, so here we look at ContinueProcess:

For the failure case, we examine at ProcessingFailed and can see it has an Exception instead of Output:

For convenience, we added an HTTP endpoint to start the state machine; this simulates how our real application’s state machine is started by some external event, like dropping an object into S3 or a DynamoDB row change. You can use this to start the state machine from the CLI instead of the console:

curl https://yoururlhere.execute-api.us-east-1.amazonaws.com/dev/start

Do this a few times then look at the console to see the results; most will likely succeed, some will fail, due to the random chance.

On to the Code!

So how does this work? How are we defining the state machine, and how do we define the restart step, then how do we invoke it? We’ll ignore the overall state machine definition because it’s well-documented, so we can focus on the more subtle callback mechanism.

In serverless.yml we specify for the Resource the waitForTaskToken magick incantation. Normally, our state machine would specify a Lambda function as its resource, but we can’t do that when we want to wait. We then specify our Lambda under the Parameters as FunctionName, and pass into it the PayLoad containing the Step Function $$.Task.Token:

WaitForCompletion:
Type: Task
Resource: arn:aws:states:::lambda:invoke.waitForTaskToken
Parameters:
FunctionName: ${self:service}-${opt:stage}-ProcessAndCheckCompletion
Payload:
taskToken.$: $$.Task.Token
Next: ContinueProcess # the happy path

The Lambda will need to call the Step Functions API with this Task.Token to flag success or failure, so it has to be an input to the function. We can add anything else we want as an input here too.

As usual, the state has a Next for the happy path, but here we’ve defined error handlers with the Catch directive. We first try to catch an error that we specify in our Lambda, then a catch-all in case anything else blows up (e.g., a Python exception due to bad code):

Catch:
– ErrorEquals: [“ProcessingFailed”]
Next: ProcessingFailed
– ErrorEquals: [“States.TaskFailed”]
Next: UnexpectedFailure

In our Lambda handler function, we don’t actually do any processing in this demo. For the real application, we’d process our chunk and check for all the chunks being processed; if they’re not all complete, we’d just return. Here, we pretend we have determined that all the chunks are done, and signal the Step Function state machine to continue:

task_token = event[‘taskToken’]
SFN.send_task_success(
taskToken=task_token,
output=json.dumps({‘msg’: ‘this goes to the next state’,
‘status’: ‘looking good’}))

We can set the output to be anything we want to feed to the next step in our state machine.

To indicate failure, we make a similar call, and can set optional error to a named code we can catch in our Step Function, and the cause to provide more details:

SFN.send_task_failure(
taskToken=task_token,
error=’ProcessingFailed’,
cause=f’Something broke in our chunk processing chance={chance}’)

If this gets executed, the ProcessingFailed should get caught by the Catch… ErrorEquals: [“ProcessingFailed”] clause in the state machine definition.

Conclusion

We now know how to define waitForTaskToken and pass tokens ot lambdas so they can signal success and failure to restart the state machine, and can use it with the Serverless Framework’s Step Functions plugin with ease. Step Functions invoke Lambdas as Tasks asynchronously, so we may have many opportunities to have the state machine pause and wait for completion of a longer-running lambda, or many parallel lambdas.
June 3, 2019

V! Studios Receives Nomination for a 2018 Emmy® Award

By Mike Brody

after effects, animation, NASA, science, ScienceCasts, video

Share this post:
: V! Studios Receives Nomination for a 2018 Emmy® Award

V! Studios has received a nomination for a 2018 Emmy® Award for its work on the NASA ScienceCasts episode, “Two Sides of the Same Star.” The NASA ScienceCast series highlights scientific research and discoveries, keeping audiences informed, advancing understanding, and bringing wonder through animation and visualizations.

“Two Sides of the Same Star” explores the nature of neutron stars. Animations are used throughout the episode to explain the variability of neutron stars’ magnetic fields, and the scientific debate over the evolutionary stages of a neutron star.

The nomination comes from the National Capital Chesapeake Bay Chapter (NCCB) of the National Academy of Television Arts & Sciences (NATAS) in the category of Health/Science – Program Feature/Segment. The NCCB is a non-profit, professional organization serving the Maryland, Virginia and Washington, DC television community. The NATAS Emmy® Award is the industry’s benchmark for the recognition of television excellence.

The 61st Emmy® Awards will be Livestreamed on June 22, 2019 at www.capitalemmys.tv/emmys.

Headquartered in Tysons Corner, VA, V! Studios is a unique hybrid company, successfully combining left brain and right brain skills to weave technology, information, and the arts into innovative and effective products and services. Learn more about V! Studios services at: V-Studios.com.

8.5.6
March 1, 2019

Quick process of adapting Megascan Atlas images into volumetric lighting scenes.

By Troy Benesch

3d models, C4D, Cinema 4D, extrude, megascans, octane, octane render, Quixel, seaweed, underwater, volume, volumetric lighting

Share this post:
: Quick process of adapting Megascan Atlas images into volumetric lighting scenes.

If you are like me, and want to use volumetric atmosphere in a scene that incorporates the use of scanned imagery from Megascans (using their 2D scanned imagery known as Atlases) to populate the area, you are out of luck using the standard method of applying the images to planes and use the opacity channel to cut-out the shape. This is because the outline of the plane is still shown in a scene that incorporates volumetric fog/lighting, etc.

To get around this limitation, I have found a moderately quick method to get the look you want with the proper shadows and such. This process involves the use of three software packages. Adobe Photoshop, Adobe Illustrator and Cinema4D and the Megascan Atlas source file(s) from quixel.com.

For this blog, I’m going to talk about incorporating some Seaweed images from Megascans, into my underwater scene. Here is what the scene looks like with the alpha channeled image plane cut-out approach. You’ll notice the planes are easily visible along the bottom of the seabed, even though we ‘cut out’ the shape of the seaweed with the alpha channel/opacity channel of the seaweed.

With the process I’m going to describe below, here is what the scene will now look like:

When writing this blog, I wanted to encompass all levels of experience. So, I apologize in advance if you already know many of these steps. But, hopefully, you’ll still find some useful information in this approach. There are many ways to achieve this look, and so this is just one of many, but it was one I found quick and easy. There are plug-ins that can expedite this process, so feel free to experiment further. I’ve broken the process down to 26 steps. Here they are:

1) Navigate to the megascan library residing on quixel.com to find some seaweed images to use on the seafloor (https://quixel.com/megascans/library?search=seaweed).

2) After selecting the ‘Plant Seaweed’ Atlas that I want to use, I download it to my desktop. You will notice that when you open it up, it contains a variety of files (Albedo, Bump, Specular, Normal, etc.) and that they are in 4K resolution.

Depending on how close you are going to get to the image, you may only need to use a couple of the files. For my purpose, I’m going to use it to populate the seafloor in my scene and don’t anticipate getting very close to it. So, I’ll only use the ‘Albedo’ (for the color) and ‘Opacity’ (for the cutout) images, but feel free to use whichever maps makes the most sense for your project. And I’m also going to reduce the resolution quite a bit since I’m not going to need that much detail and I want to always conserve memory in Octane since I’m running it on either a GTX 1060 or GTX 1070 most of the time.

3) Next, open up all the image files you’ll want to use for your project in Adobe Photoshop (or whichever app you use). In my case, I’ll open the Albedo and Opacity files.

4) Copy and paste each of the image files into layers on the ‘Albedo’ file. You want to have all the channels on one image as layers so that when we crop them, they will all line up perfectly.

5) Once you have that done. Reduce the image to something more appropriate to your needs. In my case, I’m only going to need a resolution of 512×512 pixels. Go under Image>Image Size and select the resolution you want to shrink it to.

6) Then select the ‘Opacity layer’ and select ‘Image>Adjustments>Invert’ to invert the opacity layer to black on white as Adobe Illustrator see’s paths as black on white.

7) It should now look like this:

8) Do a ‘Save As’ of this Photoshop file as a native PSD format file. We are going to create two separate objects from this file. The seaweed on the left and the seaweed on the right. So, use the ‘Crop’ tool to crop the image on the left first. Crop it to as close as you can to the borders of the seaweed image.

9) It should look like this now:

10) Next we will Save out the Opacity and Albedo layers as two separate images. Go to File>Export>Quick Export as PNG and save it as ‘left seaweed opacity’ (or something similar if you aren’t using Adobe Creative Cloud).

11) Next hide the ‘opacity layer’ layer and do another File>Export>Quick Export as PNG and save it as ‘left seaweed albedo’. Since it is the same dimensions as the opacity layer, when we import it into C4D as a texture it will fit perfectly.

12) Now go back several steps to right before we cropped the left seaweed image, so that you will have both images visible again. This time crop the right seaweed and repeat steps 9-11 (naming your files ‘right’ instead of ‘left’ – as I’m sure you already know) 🙂

13) Now we will jump into Adobe Illustrator. Once you have Illustrator open, open up the ‘left seaweed opacity’ file you had saved. Then select the left seaweed image.

14) The open the ‘Image Trace’ window by going to Window>Image Trace, this will pop up the floating ‘Image Trace’ window.

15) In the ‘Image Trace’ window, select the ‘Preset’ drop down menu and choose ‘Silhouettes’ which I found worked well for this image.

16) It’s going to look a little blobby. Don’t worry, we’ll clean it up. Toggle down the ‘Advanced’ arrow to see additional controls. I found that the settings of 180 for the Threshold, 100% for the Paths, 100% for the Corners and 50 px for the Noise created a good clean image. You want to make sure you don’t adjust the Threshold and Noise setting too low or you’ll have free-hanging portions of your image which will import with issues into C4D.

17) Now we want to Save the file out as an Adobe Illustrator format file. File>Save As. Create a folder, or use one you already designated, and name it something logical like ‘left seaweed-opacity.ai’.

18) You will get prompted by a dialog box asking you what version to save it as. You MUST select Illustrator version 8, as C4D only reads that format.

19) Okay, almost there. Now launch Cinema4D and open the ‘left seaweed-opacity.ai’ file. You’ll be prompted by a dialog box asking you what ‘Scale’ you want to bring it in as. I found that for my purposes, a Scale of 0.05 Centimeters and Connect Splines and Group Splines checked, worked well.

20) The imported Illustrator file should look like this in your C4D window. It should come in as a spline object.

21) Next, you’ll want to add an ‘Extrude’ attribute to the imported seaweed.

22) Put the ‘left seaweed-opacity’ spline object under the ‘Extrude’ attribute. Then in the ‘Extrude’ options, select the ‘Object’ tab and make the Z Movement something like .25

23) Now we need to texture it. Create an Octane Diffuse Material and Import the ‘Seaweed Left Albedo’ image we had created earlier and apply it to the Diffuse channel.

24) We need to apply the new material onto the Extruded object. However, change the ‘Projection’ method from ‘UVW Mapping’ to ‘Flat’.

25) Finally, Right-Mouse click the texture icon and select ‘Fit to Object’ and that’s it! Repeat the process for the ‘Right Seaweed’ image and any others that you want to use. You should probably name the file something like ‘Left Seaweed’ so you know what it is as you start importing additional seaweed objects.

26) Now copy and paste this object(s) into your volumetric scene project to replace the decal planes and you have a clean model of the seaweed(s). This process is actually very quick once you understand the process. Hope this helps!

Here is a finished render using the process I outlined in this blog for both the seaweed particulates floating in the water and growing on the seafloor.:
December 5, 2018

Unlocking table data using open source OCR

By Chris Shenton

extraction, machine learning, NASA, ocr

Share this post:
: Unlocking table data using open source OCR

This summer we were awarded a small research grant from NASA’s Technology Data and Innovation Division to investigate extracting structured information from scans of engineering documents, and we recently demoed our proof-of-concept app for the project to NASA’s Office of the CIO. Our previous work for the NASA Extra Vehicular Activities (EVA) Office used serverless cloud computing and optical character recognition (OCR) to extract unstructured text, and make documents searchable. For this project, NASA asked us to retrieve structured tabular data from the parts lists in their technical diagrams. Because manual entry of these details is tedious, slow, and error-prone, NASA is looking for software tools to assist human technicians by making this process easier, faster, and more accurate.

After surveying the literature, we came up with several candidate approaches. Though we initially expected to use OCR software to solve the entire problem, we found it was unable to reliably extract all the content from tables it identified in the diagrams. In the end, we came up with a three-step approach combining best-of-breed open-source tools: (1) use techniques from computer vision to identify horizontal and vertical lines; (2) cluster the parallel lines to infer table rows and columns (and, by extension, cells); (3) extract text from the cells using OCR.

With the server-side algorithm identified, we developed a simple, focused UI to help users feed in the images of parts list tables. First, the user selects and uploads a document (Figure 1), which our software converts to an image for display. The user then “lassos” the desired table inside of this image (Figure 2). Finally, the server does the extraction and returns a downloadable CSV which the user can view/edit in Excel, Google Sheets, etc (Figure 3).

Since we can apply our technique to extract text and row, column, and cell relationships from any tabular data source and we can’t post NASA’s sensitive spaceflight hardware diagrams here, we’ll be substituting an engineering diagram we found on the Internet.

Figure 1: user-uploaded diagram

Figure 2: user lassos table

Figure 3: the extracted table text in a spreadsheet

As you can see, the accuracy of text extraction and row, column, cell preservation is outstanding even when starting with a low-resolution, low-contrast scan of a technical drawing.

We’re very happy with the results of this quick proof-of-concept, and look forward to applying it to new data sets and use cases to refine it more. We have some ideas for improving the feature set, and are really interested in comparing and/or combining it with AWS Textract to prepare data sets for domain-specific tabular data extraction AI’s! If you’re interested in scheduling a demo or have suggestions on future directions for this work, please contact us at info@v-studios.com or leave a comment below!
November 2, 2018

Lambda-generated Presigned S3 URLs with AES encryption: CORS is Hell

By Chris Shenton

aws, encryption, lambda, s3, serverless

Share this post:
: Lambda-generated Presigned S3 URLs with AES encryption: CORS is Hell

This is a followup to a previous post on how we use Lambdas to generate presigned URLs so that a user’s browser can directly upload to S3. We now want to have our S3 bucket enforce server-side encryption for all uploaded files. Getting all the pieces to work together was a bit hairy: bucket policies, URL settings, HTTP headers, and mostly the dreaded CORS configuration. This should be applicable to other upload properties as well. Finally, we close with a comparison of the default AWS signature algorithm and the newer V4 signatures.

Architecture

The upload portion of our architecture looks like the following diagram. An Angular application is served from an S3 bucket to the browser. It has a component to select a file and invoke a getUploadURL function which sends the filename and MIME type to a Lambda function; the Lambda calculates a presigned URL which permits uploading for a short time, using the IAM permissions applied to the Lambda. This allows the browser to do secure uploads without leaking credentials; more details on this are in that earlier post.

Our system resources, policies and the Lambda code is defined using the Serverless Framework; it tames the complexity and makes deployment a breeze.

S3 Bucket Policy Enforces Crypto

We define a policy on our S3 bucket that requires uploads to use server side encryption (SSE) with the AES-256 cypher. It does this by checking the appropriate headers supplied with the upload. Rather than repeat it here, check the AWS docs.

Lambda returns Presigned URLs with SSE

When we generate the presigned URL, we include a requirement for SSE using AES. We’re using Python and the Boto3 SDK.

s3 = boto3.client(‘s3’) # See below about non-default Signature version 4

params = {

‘Bucket’: UPLOAD_BUCKET_NAME,

‘Key’: ‘doc_pdf/’ + filename,

‘ContentType’: content_type,

‘ServerSideEncryption’: ‘AES256’

}

url = s3.generate_presigned_url(‘put_object’,
Params=params,
ExpiresIn=PSU_LIFETIME_SECONDS)

The URL we get includes query string parameters indicating we want x-amz-server-side-encryption, and the shape of the URL depends on the AWS signature version we’re using (see below).

This seems fine, but it doesn’t actually force the encryption. The generated URL can only specify information on the URL’s query strings, but S3 doesn’t look at those — it looks for HTTP headers to tell it how to disposition the upload.

Browser Must Set SSE HTTP Headers

Since S3 wants HTTP headers to tell it to enable encryption (as well as Content-Type and other metadata), we must have our client code set them. In our Angular app, we do this:

putUploadFile(uploadURL: UploadURL, file: File, fileBody): Observable<any> {

const headers = {

‘Content-Type’: file.type,

‘x-amz-server-side-encryption’: ‘AES256’, // force SSE AES256 on PUT

};

const options = { ‘headers’: headers };

return this.http.put(uploadURL.url, fileBody, options).pipe(

tap(res => console.log(`putUploadFile got res=${JSON.stringify(res)}`)),

catchError(this.handleError<UploadURL>(‘putUploadFile’, null))

);

We can watch the browser console and get the generated URL, we can use “curl” to PUT to the S3 bucket with the same presigned URL and HTTP headers: our upload works:

curl -v -X PUT

-H “Content-Type: application/pdf”

-H “x-amz-server-side-encryption: AES256”

–upload-file mydoc.pdf

$PresignedUrlWeGotFromLambda

However, when the Angular app does the HTTP PUT, it fails.

NG PUT requires S3 CORS allowing SSE Header

The console shows errors in the HTTP OPTIONS preflight check; this sure smells like a CORS problem. When we had our serverless.yml create our bucket, we defined a CORS configuration that allowed us to PUT, and to specify Content-Type headers. We just need to add a new CORS setting to tolerate the SSE header.

Type: AWS::S3::Bucket

Properties:

BucketName: ${self:custom.s3_name}

CorsConfiguration:

# Needed so WebUI can do OPTIONS preflight check

CorsRules:

– AllowedMethods:

– PUT

AllowedOrigins:

– “*”

AllowedHeaders:

– content-type

– x-amz-server-side-encryption

We could have configured it with AllowedHeaders: “*” but that’s more permissive than we’d like, so we opt to be explicit in what we tolerate.

We redeploy our Serverless stack to update the S3 configuration, and our app starts uploading successfully! If you’re not doing this with Serverless, just update through the AWS Console or whatnot.

Now we can see the files we uploaded are AES-256-encrypted:

AWS Signature: default versus V4

By default, the boto3 S3 client is not using AWS Signature Version 4, and the upload does work. We’ve used V4 before on other projects, and understood it to be best practice; we thought it might be required, but turns out it’s not. However, we can enable V4, and it works great. Interestingly, the generated presigned URLs are very different.

In both cases, the base URL we get is the same:

https://myuploads-dev.s3.amazonaws.com/doc_pdf/mydoc.pdf

There are significant differences in the query string parameters appended to this. Below we show the decoded parameters for comparison.

Default Signature

We get an S3 client with the default signature algorithm:

s3 = boto3.client(‘s3’)

The query string parameters are:

AWSAccessKeyId: ASPI31415926535

Signature: Vqfl0NqIrr6ifBB3f9T1hXI5/+U=

content-type: application/pdf

x-amz-server-side-encryption: AES256

x-amz-security-token: …

Expires: 1541015925

AWS V4 Signature

We can request the V4 signature like:

s3 = boto3.client(‘s3′, config=Config(signature_version=’s3v4’))

The query string parameters become:

X-Amz-Algorithm: AWS4-HMAC-SHA256

X-Amz-Credential: ASPI31415926535/20181031/us-east-1/s3/aws4_request

X-Amz-Date: 20181031T190519Z

X-Amz-Expires: 3600

X-Amz-SignedHeaders: content-type;host;x-amz-server-side-encryption

X-Amz-Security-Token: …

X-Amz-Signature: a22b58dce238ed393026027ec0b40a7ffd0a9647d792fb0cc3d720bc1cc89fe4

Wrap-up

There are a lot of cat-herding to make this work, but once in place, it works beautifully: enforced encryption, time-limited presigned URLs, and browser uploads to S3. Now that we know all the pieces that need to be addressed, we can use the same approach to add other S3 object properties, like read-only ACLs, expiration date, etc.