We recently released v1.7.0 of Jidoteki Meta, and with it came a little unexpected surprise: it bricked a few customer appliances.
In this post, I’ll explain how not to brick a customer appliance, as well as a few techniques to recover when it happens.
Oops!
Our build process is entirely automated - duh, that’s why it’s called Jidoteki (automatically, in Japanese) - but our release process isn’t. There’s a series of roughly 40 steps on a long checklist prior to making a Jidometa release. Most tasks are automated and only take a few seconds to complete/validate, but others are manual and require a bit more time.
In our most recent release, we changed something which shouldn’t have been changed (the default /etc/fstab file), which prevented the /data disk from being mounted - but only on older deployments.
Problems compounded
To make matters worse, our fstab blunder had the cascading effect of not starting Openssh, Nginx, or the Jidoteki API services, which made remote administration impossible (thus impossible to update the appliance).
A missing validation
What we didn’t realize was that our release process didn’t include a step to validate our builds/updates against, you guessed it, older deployments! We did test the updates against slightly older appliances, but not against the oldest ones - the ones which some of our great customers were still running.
No worries though, we’ve added that to our process/checklist and can guarantee it won’t happen again.
Recovery
Luckily (and stupidly), our initial appliances shipped with a default root password (known only to us). We were able to provide instructions to login via the Linux console, start the necessary services, and then access the UI to upload an update package containing the fstab fix.
In our latest appliances, which are slightly more secure, there is no default root password anymore. In fact, logins are completely disabled, even by SSH. We provide the ability to change the admin password via the console GUI, but that only provides access to the files in /data (customer files), not root.
Recovery without password
We’ve customized the boot menu to provide just enough time to modify the boot command. Simply removing the ,/boot/rootfs entry will load the default TinyCore installation, which includes a root user with no password! omg! That’s a good thing. It means it’s still possible to fix a bricked appliance.
I know, it seems quite insecure at first glance, but the reality about a Linux virtual appliance is that, anyone with access to the host machine can get root. There’s no way to prevent that.
Other techniques include mounting the disk(s) in another appliance, or booting from a recovery ISO/CD.
In the end, there’s so many ways to obtain root access and to recover from such issues, that there’s no real point in preventing your customers from having it. Security through obscurity is a wasted effort.
Moving forward
To avoid bricking customer appliances, we’ve decoupled essential services from the boot process. They will start no matter what, and always provide remote administration capabilities.
Secondly, the design of Jidoteki appliances makes it very easy to either obtain root access (from the console, obviously not over the network), that we don’t need to worry about the consequences of a “bricked” appliance. The customer data is never touched by the updates, and they’re free to obtain access and perform a manual recovery procedure.
We’re working on an automated process which actually validates (integration test?) an appliance once it’s built, against a set of criteria (ex: does X service start, are disks mounted correctly, etc). We’ve already written the test suite and have been using it for a while on newly built appliances (not updated ones). Our last step is to integrate it to Jidometa and automatically run it against the builds/updates when they complete.
Finally, wouldn’t it be nice for an appliance to self-heal? Yes, it would, and we’re working on just that. The idea isn’t a new one (I implemented something similar in 2009 while working on a custom Linux OS), but essentially rather than overwrite the existing OS during an update, we could rename the file and have a second “recovery” boot option which boots from a working version when the primary one fails.
Contact us
As usual, if you’re planning on providing your On-Prem or Installer-based application to enterprise customers, contact us so we can discuss the details of your setup.