Apparently I started this blog post in January of 2022. In November 2022 I discovered that I hadn’t posted a single blog post all year, and this was in danger of not getting published in the same year I wrote it.
S3 and CloudFront are awesome for blogging. I use hugo for this web site and most any other web site I create. There are lots of tutorials like this one that will tell you how to do the basics. But there’s this one persistent, annoying problem with static web sites and the S3/CloudFront solution.
The web began by just serving files out of file systems. If you had a URL like
http://www.example.com/photos/cats/cat1.jpg, that almost certainly meant that somewhere there was a directory
photos on a server, and that directory had a subdirectory
cats and in that directory you’d find a file
cat1.jpg. But what if we go to the directory? Like
http://www.example.com/photos/cats/? That represented a directory full of files and we want to see the Index of what’s in that directory. Hence
index.html was born and it became the file you display when someone requests a directory on your web server. If the URL
http://www.example.com/photos/cats/cat1.jpg maps to the file
/www/html/photos/cats/cat1.jpg on my server, then I expect the URL
http://www.example.com/photos/cats/cat1.jpg to map to
CloudFront STILL doesn’t do this
With a CloudFront/S3 blog, I might have an object whose address is
s3://blogbucket/photos/cats/cat1.jpg. I stick a CloudFront distribution in front of it and it becomes possible to access it with a URL like
https://blog.example.com/photos/cats/cat1.jpg. That’s nice, but what happens if I access a URL like
https://blog.example.com/photos/cats/? Nothing. You tend to get error
404 Not Found because there is no object with that name.
It’s pretty disappointing that, after all these years, CloudFront still doesn’t reproduce one of the most common web URL behaviours. The S3/CloudFront use case is so compelling and has been around for a really long time, but it STILL doesn’t solve this problem. CloudFront will let you redirect the root object (i.e., at the root of your web site), but it won’t do this for anything other than the root.
A common solution, that isn’t great, is this AWS blog from 2017. This solution isn’t great because it’s running code to resolve the URL and do the redirection. This code (a) costs money to run, and (b) can break through no fault of your own. I launched that code with some version of NodeJS that was current at the time and forgot about it. Years later, Amazon deprecated that version of NodeJS and I had to go back and update my Lambda function to keep it working. Needless complexity when all you’re trying to do serve a totally normal URL. It does work, though: It scales. The lambda function is only called once per URL. If your web site gets hammered with requests, CloudFront caches the response from the Lambda function and doesn’t need to invoke the Lambda many times. But the entire reason people run static text web sites is so that they don’t have code running their web site! And running 20 lines of NodeJS just to handle default path objects is still running code. And if that code goes offline (e.g., because it’s built on a deprecated runtime) my blog goes offline. That’s not what I want.
A Better Solution:
I found Luke Plausin’s blog and it is a clever solution. By using the S3 API (the
s3api command), you can actually create an object at the directory marker and write HTML to it. There are a lot of advantages to this and a handful of disadvantages to it. The primary advantage, in my mind, is that I write an object once and totally forget about it. There’s no code, ever. I can go a year without messing with anything (like I did for this blog post!) and nothing breaks on its own.
When you host a static web site generated by Hugo in an S3 bucket, the general idea is to run
aws s3 sync public/ s3://blogbucket/ and let the magic happen. Tou quickly realise that it’s more complicated than that, though. You’ll need a script to deploy it right.
To integrate Luke’s solution, I have this final loop that finds all the
index.html files and writes each one 2 more times. So if I had
/photos/cats/index.html, then I also will write
/photos/cats/ with the same HTML that’s in
Pay attention to some other things I do in there, like making S3 objects public and setting their mime types and cache parameters. (e.g.,
The other thing I do is set up some URLs for
robots.txt and the right RSS feed URL.
The Main Problems
The main problem with this approach is that (at the time of this writing) I have 906 pages in my blog. Hugo itself creates a lot of static index files, like tags and categories and it makes
index.html files in each of those directories. My
public folder has 662
index.html files in it, meaning I have to do 1986
s3api calls. Moreover, when I update the blog basically ALL of the tags and categories pages update. So I must update those S3 objects. I can’t ignore this.
If you look at my
sync.sh script, you’ll see that I do a couple things to optimise, but I don’t think there’s any getting away from 1900+ S3 API calls. These days, my
sync.sh script takes like 35 minutes to run, and I find that ridiculous. There are 1499 files in my
public directory on my laptop (that I sync to S3). But my bucket has 2738 objects in it because of all this duplication.
In an effort to be a bit efficient, I fetch the AWS credentials once, and then I reuse them over and over. For my use case, this is fine. But I allow an
AssumeRole to issue credentials valid for only 60 minutes. At the moment, deploying my blog takes 35 minutes. If it gets longer than that, I will have to do something to handle the errors or make the credentials last longer. Yes, I could run this on a code pipeline or an EC2 instance, but that’s more expensive than letting it run on my laptop.
So this is a bit of a clunky solution. Publishing a change to my blog, even a small one, takes 35 minutes to deploy. And then the cache times I’ve chosen cause pages to persist in their not-updated state for a pretty long time (up to a day). I could use CloudFront cache invalidations to either wholesale invalidate my whole global cache, or try to figure out what has just been published that’s different and only invalidate that.
Whenever I have some clunky solution that I just live with, I think there must be a better way. Feel free to communicate with me on Mastodon (below) if you have a better way to get past some of these things.
"Diagonal arrow made from puzzle pieces" by Horia Varlan is licensed under CC BY 2.0.