TL;DR:
Adding IAM Auth requires increasing the RDS server at least 4x the cost for a server with password auth, and likely much more for production. This makes it non-viable for our immediate use-case with a relatively low-stress app.
Goal: no database passwords in code/configs
We’re running a Wagtail CMS site (built on Django) and we don’t want to use passwords to authenticate Wagtail/Django to our database or other AWS services since they present a risk if discovered and are hard to manage outside our committed code repo.
Typical access to PostgreSQL databases use credentials including host, database, username, and password, and Django settings include these. We do not want to store creds in code, so we want to leverage AWS IAM Roles to permit connections — a cloud-native mechanism.
We are using SES for Django email with AWS IAM authentication, obviating the need for passwords to authenticate to their SMTP server. We want to do the same for RDS access to our PostgreSQL database.
This gets pretty nerdy, but hope it helps other user IAM for Django RDS auth, and avoid a little-publicized problem in the current AWS implementation.
IAM Auth, EC2 Role, Django DB Wrappers
I first tried the approach from https://stackoverflow.com/a/57923227/4880852 with the 10-line wrapper in your.package.postgresql/base.py and it worked great — for 15 minutes, after which the temporary token it got expired. Oops.
Below we describe how we used the package mentioned in that post, with IAM and an EC2 Role. In the examples below, our top-level database is default “postgres” with user “rdsiamuser”, and our Django database is “rdsiam”. We use CloudFormation for our Infrastructure-as-Code to create all our resources.
Enable IAM Auth in CloudFormation
Enable IAM Auth in the CloudFormation definition of the database:
SQLDatabase:
Type: ‘AWS::RDS::DBInstance’
Properties:
Engine: postgres
DBName: !Ref DBName
MultiAZ: !Ref MultiAZDatabase
MasterUsername: !Ref DBUser
MasterUserPassword: !Ref DBPassword
EnableIAMDatabaseAuthentication: true
DBInstanceClass: !Ref DBInstanceClass
AllocatedStorage: !Ref DBAllocatedStorage
We have to add an IAM Role to our EC2 instance so it can use IAM Auth to talk to RDS; in CloudFormation, like this:
RolePolicies:
Type: AWS::IAM::Policy
Properties:
PolicyName: root
PolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Action: rds-db:connect
Resource: !Sub “arn:aws:rds-db:${AWS::Region}:${AWS::AccountId}:dbuser:*/${DBUser}”
The “dbuser” is a literal AWS term, the ${DBUser} is the same as our Django setting, or top-level RDS user. We have to use the wildcard “*” for the database because CloudFormation gives us no way to determine the RDS DbResourceId! 🙁
Set IAM auth for RDS user
After CloudFormation creates the EC2 and RDS, we can connect to the DB with the initial password we gave it then give the initial DBUser the rds_iam permission to authenticate. Since my RDS was inside a private VPC, I found it easiest to launch a PostgreSQL docker container on the EC2 and run the command there:
ec2# docker run -it postgres:alpine bash
docker# psql -h MyRdsDbHost.us-east-1.rds.amazonaws.com -U rdsiamuser postgres
postgres=> GRANT rds_iam TO rdsiamuser;
postgres=> du;
Role name | Attributes | Member of
—————–+———————————+————————-
rdsiamuser | Create role, Create DB +| {rds_superuser,rds_iam}
| Password valid until infinity |
Now you can see the user has the rds_iam role, so it can use IAM to auth. Warning: this will prevent that user from logging in with normal password credentials! I didn’t see this mentioned in the AWS docs.
Of course the goal is to not have passwords in code, and above we show our CloudFormation which likely has the password committed to the repo. But after GRANTing rds_iam, the password no longer works, so this security bug turns into a security feature. I see no way of setting this GRANT at RDS creation time or any other mechanism to allow IAM auth through CloudFormation.
Django config
I then used the code referenced in the post at https://github.com/labd/django-iam-dbauth , adding it to my requirements.txt. After reading the code and walking through it I realized the README docs were incomplete and we must supply a region for it to work. This is what I ended up with in my settings/dev.py file:
DATABASES = {
‘default’: {
‘HOST’: os.environ[‘DATABASE_HOST’],
‘NAME’: os.environ[‘DATABASE_NAME’],
‘USER’: os.environ[‘DATABASE_USER’],
‘ENGINE’: “django_iam_dbauth.aws.postgresql”,
“OPTIONS”: {
“use_iam_auth”: True,
“region_name”: “us-east-1”,
},
}
}
Look, ma — no PASSWORD! Then ENGINE and OPTIONS were the critical bits; the other info comes from the environment our Docker container runs it, pretty standard.
I could then run Django on my EC2 (we do it inside Docker) and it worked great for more than 15 minutes, so we knew the django-iam-dbauth worked without token timeouts. All was well… until it wasn’t.
Django Failures, rdsauthproxy Failures
After about an hour, I started seeing authentication failures in the Django logs. We’re running it in gunicorn and saw:
[2021-05-06 18:23:31 +0000] [52] [DEBUG] GET /
psycopg2.OperationalError: FATAL: PAM authentication failed for user “rdsiamuser”
FATAL: pg_hba.conf rejects connection for host “10.42.9.163”, user “rdsiamuser”, database “rdsiam”, SSL off
and later, timeouts from gunicorn, presumably because Django could not auth to build a response:
[2021-05-06 19:44:35 +0000] [50] [CRITICAL] WORKER TIMEOUT (pid:63)
[2021-05-06 19:44:36 +0000] [50] [WARNING] Worker with pid 63 was terminated due to signal 9
This was a test instance with virtually no load except ALB health probes (GET /). RDS was on a db.t3.micro which had been fine for our developers exercising it when using typical password auth, so something broke when we switched to IAM. It seemed to recover after a while, briefly, then failed again and never did recover.
A look at the RDS Logs showed the cause of the problem:
* connect to 127.0.0.1 port 1108 failed: Connection refused
* Failed to connect to rdsauthproxy port 1108: Connection refused
* Closing connection 0
2021-05-06 20:35:10 UTC:10.42.9.163(35716):rdsiamuser@rdsiam:[31065]:LOG: pam_authenticate failed: Permission denied
2021-05-06 20:35:10 UTC:10.42.9.163(35716):rdsiamuser@rdsiam:[31065]:FATAL: PAM authentication failed for user “rdsiamuser”
2021-05-06 20:35:10 UTC:10.42.9.163(35716):rdsiamuser@rdsiam:[31065]:DETAIL: Connection matched pg_hba.conf line 13: “hostssl all +rds_iam all pam”
2021-05-06 20:35:10 UTC:10.42.9.163(35718):rdsiamuser@rdsiam:[31067]:FATAL: pg_hba.conf rejects connection for host “10.42.9.163”, user “rdsiamuser”, database “rdsiam”, SSL off
It appears that RDS has a proxy for PostgreSQL called “rdsauthproxy” but it died for some undeclared reason. It may have come back once or twice but eventually went down permanently and the IAM auth never worked again. I was able to stop then restart the RDS in the AWS console and the rdsauthproxy would come back, but it would soon go down again without a trace.
I found only one hit on this topic, from August 2020, with a “me too” from March 2021, and have posted my “me too” reply; zero response from AWS: https://forums.aws.amazon.com/thread.jspa?threadID=326681
AWS says too small, known problem
I filed a support ticket with AWS and they said that my t3.micro instance had a “CPU Baseline: 10%” and that it had been consistently above this, and that the T3 burstable “CPU Credit” dropped to 0 so it had used all its credits; Freeable Memory dropped to very low, and swap was high.
It sounded like I was thrashing the underlying T3 instance and exhausting resources, and that was probably what was killing rdsauthproxy. But why was it fine, under much higher developer load, when using Password auth instead of IAM auth?
AWS Support then went on to say:
while our internal team is indeed investigating further on this, since this issue is not really a bug and is mostly related to resource throttling, there might not really be a “fix” for it, as such. Therefore, we recommend all our customers to ensure that they have enough resources to have a seamless experience and avoid such scenarios.
We have no guidance of how much we have to scale up our RDS instance before it will stop dying: double? quadruple? different instance type? who knows… 🙁
db.t3.small seems to work — then falls over
I tweaked my CloudFormation to replace the micro instance with a db.t3.small and redeployed. Happily, it copied all the data and after a while the app started working again. Still, we’re not putting any load on this test instance. For our QA and Prod servers, we’ll have to watch load carefully; maybe use a db.m5.* instance instead of burstable db.t3.* instance.
About 12 hours later, with zero app load, we saw auth failures and RDS load went from < 10% to about 30%. So this size is too small for anything with IAM auth.
Adding IAM auth doubles, quadruples, octuples cost
Our db.t3.micro was running for months for our devs, demoing to our customer, it never fell over and was snappy enough; it costs $0.018/hour, or $13/month — totally reasonable for a Dev and maybe QA instance.
The db.t3.small fell over in 12 hours, and costs twice as much. A db.t3.medium costs twice that. So our dev instance is now costing 4x what it used to in order to support IAM auth, and we don’t know if it will fall over under anything but minimal load.
For QA, we’d need at least the same size, 4x the cost of the micro.
For Prod, we’d need 2x instances and probably an “m” instance so we don’t run out of “t” burst credits. Minimum in that class is db.m6g.large at $0.159/hour, that’s almost 10x the cost of our micro, and we need 2 for failover, about $230/month. That’s a lot of money for our fairly-small commercial app’s database, especially since if we used the same in Dev and QA (with only one instance each), it adds another $230/month. $500/month for a small app with Dev, QA, Prod. Not a way to win customers.
The point is the overhead of running IAM auth is causing us to increase or DB cost by roughly 4x – 10x compared to password-based auth!
Could we switch to Aurora? Let’s presume it can do IAM auth without falling over. If we go to the calculator and pick the lowest price option (db.r4.2xlarge) the price is $847! OK, that option’s out.
Maybe if our DB needs were big, and we already required a db.t3.xlarge or db.m6g.large the IAM penalty wouldn’t be noticeable, but it’s noncompetitive for our use case.
So you’ve GRANTed rds_iam to your PostgreSQL user and locked yourself out of your database. Now what?
You can use the AWS console to disable IAM authentication, and wait for it to reconfigure, and try later.
Or you can use the technique from
https://aws.amazon.com/premiumsupport/knowledge-center/rds-postgresql-connect-using-iam/
to get a 15-minute password token, just like our Django plugin uses via Boto3 calls.
On the EC2 with access to the RDS, use the “aws” CLI to generate a token. If this isn’t installed on your EC2, you can use their Docker image. Generate an auth token:
token:
export PGPASSWORD=”$(aws rds generate-db-auth-token –hostname $RDSHOST –port 5432 –region us-east-1 –username rdsiamuser)”
Then, connect using a Dockerized Postgres client on the EC2::
docker run -it
-e PGPASSWORD=$PGPASSWORD
postgres:alpine
psql -h $RDSHOST -p 5432
“sslmode=require dbname=postgres user=rdsiamuser”
This takes about 8 seconds on the first connection but it’s quick on subsequent connections; maybe the rdsauthproxy has a cache. Then you should be able to create another user with normal password creds or perhaps revoke the IAM creds to restore password login:
postgres=> REVOKE rds_iam FROM rdsiam;
Then you should be able to login with a password and do whatever else you need.