Configuration updates during disruption and outage

February 09, 2022

Imagine this: you’re running a reasonably important site, and you decided to deploy LibResilient to make it, well, resilient.

Your config sets up the fetch plugin, then local cache, and then alt-fetch with some nice independent endpoints (say, an IPFS gateway here, an Tor Onion gateway there). Perhaps with some content integrity checking deployed too, for a peace of mind (no need to completely trust those third party gateway operators, after all).

Obviously you run your own IPFS node and a Tor Hidden Service for these to work correctly, but your website visitors do not need any special software, extensions, or configuration — they just visit your site in their regular browser; LibResilient handles everything else behind the scenes.

Then you experience an outage

Maybe your server keeled over, or maybe it’s a DDoS.

Good news is: for all visitors who had visited your site before, everything seems to work just fine (if perhaps a tiny bit slower than normally). Your website content is cached on IPFS nodes, and IPFS gateways are happily serving the requests LibResilient is sending their way. Local cache makes the experience quite seamless for content those visitors had viewed before, and the IPFS-related slowdown for content they have not is still small.

For whatever reason, however, you figure out the outage will last a bit longer, and you’d like to swap out the fetch pluging completely (no reason for visitors to wait for something that isn’t going to work anyway for the time being). You’d perhaps also want to remove the Tor Onion gateways from the alt-fetch endpoints — after all your Tor Hidden Service is down as well.

Bad news is: LibResilient’s config is a JavaScript file imported directly in the Service Worker, you have no way to update it until your site comes back up.

But now you actually do!

This is what this milestone (third one supported by a small grant from NGI Assure) was all about.

To make config updates possible during disruption and outage, the config format needed to be changed (JSON was the obvious choice), and then the whole machinery of verifying, loading, and caching it needed to be implemented.

And so now, the config file (config.json) is just a regular file that can be retrieved via any configured plugins. You don’t have to do anything special for this to work.

Let’s dive deeper into what exactly has been done this month.

1. Switching to a JSON config file (instead of JS)

This was done because the ServiceWorkers API does not provide a way to update JavaScript scripts that were imported into the Service Worker via importScripts() call, in any other way that via a direct HTTPS fetch() to the original website.

For obvious reasons that’s unworkable for updating the config during disruption/outage.

A bunch of research was required, as expected. In the end, LibResilient needed to have a roughly full implementation of what the browser does to scripts imported via importScripts(): fetching config.json, caching it, and establishing it as stale so that it can be re-fetched.

Additional benefit of this is that the config file is now not code, it’s an “inert” format (JSON). It is no longer possible to include running code directly in the configuration file. This is important for various reasons that the LANGSEC community explores at length.

This work also included implementing validity checks on the config file — something that was not really possible when config was written in directly-loaded JavaScript.

Implementing this change required me finally diving deep into the ServiceWorker lifecycle, especially the parts of it that are mostly glossed over or not mentioned at all in most documentation: what exactly happens when a ServiceWorker has been registered and installed, but is now stopped, and is being restarted?

This research was crucial to implementing the JSON config change correctly, and provided important insight that will potentially be very useful for implementing future improvements.

2. Implementing a way for config to be updated and applied during disruption

https://gitlab.com/rysiekpl/libresilient/-/issues/30

Once the JSON config change was implemented, it was possible to implement background fetching of the updated config.json file.

This required cleaning up and refactoring code implementing JSON config support, and deciding what criteria to use when establishing if a cached config.json is “stale” (currently: over 24 hours old, based on the Date: header on the cached response).

The biggest issue was figuring out what should happen if the freshly retrieved config file configures plugins that have not been loaded upon Service Worker installation. Because an updated config.json file is processed after Service Worker restart (so, not during installation), importScripts() is not available.

A decision was made to test for such such config changes and reject such a config file outright, falling back to the already cached, if stale, config.json, if the updated file was not retrieved using a regular fetch.

The rationale for this is that in such circumstances:

the original website cannot be assumed to be working correctly, as the config.json file was retrieved using an alternative transport;
the currently deployed (if stale) config.json is functional, as we were, in fact, able to retrieve the updated config.json.

Ideas for potential further improvements to this are listed here.

3. Documentation

https://gitlab.com/rysiekpl/libresilient/-/blob/master/docs/UPDATING_DURING_DISRUPTION.md

Documentation was written on how JSON config loading and updating works, and how the config can be updated during disruption or outage. It explains the rationale behind implementation decisions and their technological context.

There is obviously more work needed to make this documentation more useful and readable. But it’s a start.

Maintaining quality of code

Code written for this milestone is of course covered by tests; overall test coverage went up to ~62%.

As before, I have avoided any external dependencies what-so-ever. LibResilient remains easily deployable by simply copying a few JS files (and now, a single JSON file) and adding a single line to your HTML.

And the next milestone is…

There are four milestones on the todo list. Unclear which one I will focus on next, but that should be resolved soon. Keep an eye on the issues assigned to those milestones if you want to be the first to know!