Attempting to deploy 0.149.2 brought down hyperbo.la and caused all requests to return 502 status code. Root cause is a misconfigured PrivateLink endpoint and an unsafe ASG cycler script.
Since 2014, hyperbo.la secrets were distributed via flat files with each deployment. .env
files were necessary when hyperbo.la was a single Linode VPS, but AWS has better tools for distributing secrets.
GH-100 converted settings.py
to fetch secrets from SSM Parameter Store.
App backend instances run in private VPC subnets with no Internet egress. Packer builds run in a single public VPC subnet with a public IP.
I recently refactored terraform config to use a launch template instead of a launch configuration. Deploys are now managed by an untested ASG cycler script.
0.149.0 includes a new task runner which includes a new and untested deploy task, inv deploy
, and a rollback task, inv deploy.rollback
.
inv deploy
kicks off Packer build.inv deploy
finishes cycling the ASG.https://hyperbo.la/
shows nginx 502 page.inv deploy.rollback
.inv deploy.rollback
finishes cycling the ASG.https://hyperbo.la/
shows we are recovered.https://hyperbo.la/
shows nginx 502 page.inv deploy.rollback
.inv deploy.rollback
finishes cycling the ASG.https://hyperbo.la/
shows we are recovered.https://hyperbo.la/
shows deploy succeeds.https://hyperbo.la/healthz
confirms 0.149.2 is deployed.At this point the incident is mitigated but hyperbo.la is undeployable.
inv deploy
kicks off Packer build.inv deploy.ami
fails when running manage.py collectstatic
.At this point I spent time to deep dive into VPC terraform to properly configure the SSM PrivateLink endpoint.
inv deploy
kicks off Packer build.inv deploy
finishes cycling the ASG.https://hyperbo.la/
shows deploy succeeds.At this point, the incident is over and we are stable.
For most AWS features, the terraform documentation is descriptive enough that implementation is straightforward. PrivateLink was non-trivial to set up. Errors I made:
The ALB already hits /healthz
. Automate checking that hosts come up cleanly to not rely on manual smoke testing. Automatically detach unhealthy hosts from the ALB and halt the rollout.
cycle_asg
script.