Serving Web Directories using S3 and Cloudfront

Apparently I started this blog post in January of 2022. In November 2022 I discovered that I hadn’t posted a single blog post all year, and this was in danger of not getting published in the same year I wrote it.

S3 and CloudFront are awesome for blogging. I use hugo for this web site and most any other web site I create. There are lots of tutorials like this one that will tell you how to do the basics. But there’s this one persistent, annoying problem with static web sites and the S3/CloudFront solution.

The web began by just serving files out of file systems. If you had a URL like http://www.example.com/photos/cats/cat1.jpg, that almost certainly meant that somewhere there was a directory photos on a server, and that directory had a subdirectory cats and in that directory you’d find a file cat1.jpg. But what if we go to the directory? Like http://www.example.com/photos/cats/? That represented a directory full of files and we want to see the Index of what’s in that directory. Hence index.html was born and it became the file you display when someone requests a directory on your web server. If the URL http://www.example.com/photos/cats/cat1.jpg maps to the file /www/html/photos/cats/cat1.jpg on my server, then I expect the URL http://www.example.com/photos/cats/cat1.jpg to map to /www/html/photos/cats/index.html.

CloudFront STILL doesn’t do this

With a CloudFront/S3 blog, I might have an object whose address is s3://blogbucket/photos/cats/cat1.jpg. I stick a CloudFront distribution in front of it and it becomes possible to access it with a URL like https://blog.example.com/photos/cats/cat1.jpg. That’s nice, but what happens if I access a URL like https://blog.example.com/photos/cats/? Nothing. You tend to get error 404 Not Found because there is no object with that name.

It’s pretty disappointing that, after all these years, CloudFront still doesn’t reproduce one of the most common web URL behaviours. The S3/CloudFront use case is so compelling and has been around for a really long time, but it STILL doesn’t solve this problem. CloudFront will let you redirect the root object (i.e., at the root of your web site), but it won’t do this for anything other than the root.

A common solution, that isn’t great, is this AWS blog from 2017. This solution isn’t great because it’s running code to resolve the URL and do the redirection. This code (a) costs money to run, and (b) can break through no fault of your own. I launched that code with some version of NodeJS that was current at the time and forgot about it. Years later, Amazon deprecated that version of NodeJS and I had to go back and update my Lambda function to keep it working. Needless complexity when all you’re trying to do serve a totally normal URL. It does work, though: It scales. The lambda function is only called once per URL. If your web site gets hammered with requests, CloudFront caches the response from the Lambda function and doesn’t need to invoke the Lambda many times. But the entire reason people run static text web sites is so that they don’t have code running their web site! And running 20 lines of NodeJS just to handle default path objects is still running code. And if that code goes offline (e.g., because it’s built on a deprecated runtime) my blog goes offline. That’s not what I want.

A Better Solution: s3api

I found Luke Plausin’s blog and it is a clever solution. By using the S3 API (the s3api command), you can actually create an object at the directory marker and write HTML to it. There are a lot of advantages to this and a handful of disadvantages to it. The primary advantage, in my mind, is that I write an object once and totally forget about it. There’s no code, ever. I can go a year without messing with anything (like I did for this blog post!) and nothing breaks on its own.

My Solution: sync.sh

When you host a static web site generated by Hugo in an S3 bucket, the general idea is to run aws s3 sync public/ s3://blogbucket/ and let the magic happen. Tou quickly realise that it’s more complicated than that, though. You’ll need a script to deploy it right.

For one thing, I have to handle one-offs: My blog has been online, in one form or another, since the 1990s. I have some URLs that are somehow still out there and getting hundreds of legitimate hits every month. I put some explicit one-offs in my synchronisation script to create objects that return the right content in the URL. Also, the key to using CloudFront well is caching and the key to doing caching well is setting metadata on your objects. My synchronisation script sets really long cache times (1 day) on things like CSS files, images, and JavaScript. I put shorter cache times (1 hour) on the blog posts themselves.

To integrate Luke’s solution, I have this final loop that finds all the index.html files and writes each one 2 more times. So if I had /photos/cats/index.html, then I also will write /photos/cats and /photos/cats/ with the same HTML that’s in index.html.

Pay attention to some other things I do in there, like making S3 objects public and setting their mime types and cache parameters. (e.g., --cache-control max-age=86400,public")

The other thing I do is set up some URLs for robots.txt and the right RSS feed URL.

The Main Problems

The main problem with this approach is that (at the time of this writing) I have 906 pages in my blog. Hugo itself creates a lot of static index files, like tags and categories and it makes index.html files in each of those directories. My public folder has 662 index.html files in it, meaning I have to do 1986 s3api calls. Moreover, when I update the blog basically ALL of the tags and categories pages update. So I must update those S3 objects. I can’t ignore this.

If you look at my sync.sh script, you’ll see that I do a couple things to optimise, but I don’t think there’s any getting away from 1900+ S3 API calls. These days, my sync.sh script takes like 35 minutes to run, and I find that ridiculous. There are 1499 files in my public directory on my laptop (that I sync to S3). But my bucket has 2738 objects in it because of all this duplication.

In an effort to be a bit efficient, I fetch the AWS credentials once, and then I reuse them over and over. For my use case, this is fine. But I allow an AssumeRole to issue credentials valid for only 60 minutes. At the moment, deploying my blog takes 35 minutes. If it gets longer than that, I will have to do something to handle the errors or make the credentials last longer. Yes, I could run this on a code pipeline or an EC2 instance, but that’s more expensive than letting it run on my laptop.

Conclusion

So this is a bit of a clunky solution. Publishing a change to my blog, even a small one, takes 35 minutes to deploy. And then the cache times I’ve chosen cause pages to persist in their not-updated state for a pretty long time (up to a day). I could use CloudFront cache invalidations to either wholesale invalidate my whole global cache, or try to figure out what has just been published that’s different and only invalidate that.

Whenever I have some clunky solution that I just live with, I think there must be a better way. Feel free to communicate with me on Mastodon (below) if you have a better way to get past some of these things.

"Diagonal arrow made from puzzle pieces" by Horia Varlan is licensed under CC BY 2.0.

Comments (via Mastodon)