Go back Blog

The Great OCW.Social Migration

You might be wondering... Wtf is OCW.Social? It's the premier private Mastodon server for the cool kids from a late 2000s Nintendo Wi-Fi Connection forum OneClickWifi.

I've been running and maintaining the server since November 2022. For the longest time it was hosted on Azure, but, until recently, it's now hosted on a hybrid on-prem/cloud setup. It's been a doozy.

Why migrate off of Azure?

So why did I make the decision to migrate off of Azure? Well... Long story short: I can't afford it anymore.

Mastodon is a beast to run, but not on the CPU side of things. You can start off small with a small amount of virtual CPUs (vCPUs) and a modest amount of memory, but, once you start federating with many other servers that use ActivityPub, it will not be enough. Is it processing power that can't keep up? Nope. It's memory. You need a lot of memory. Like a lot of memory.

Mastodon: Gobbler of RAM

You will have to segment out the different processes if you want a good experience with performance and uptime. You've got the Sidekiq queues that process all of the background jobs: Incoming posts, outgoing posts, media conversion, link crawling, scheduled maintenance jobs, and more. You can definitely run them all in one process, but it will eventually get backfilled and cause a poor user experience. So the three main queues (default, push, and pull) essentially need to be spread out across multiple processes in different orders. All of those queues will need to have, at a minimum, 512 MB of memory allocated to them. Then there's the frontend web app, which needs, at a minimum, 700 MB of memory allocated to them. They will all run out of memory and need to be restarted.

That presents a problem with keeping the frontend of Mastodon up without interruption. The backend processes/background jobs can restart all they want, but the frontend going down can be a massive pain in the ass for everything (Not just the users).

I strive to keep services up and running without interruption and not being slow. When either happens, I feel bad. Not because others will really care, but rather because I don't like it happening.

So why no more Azure?

That brings us back to the original question. Why migrate off of Azure? It's the same answer, but there's more context: Running Mastodon in Azure required a lot of compute resources to satisfy Mastodon's craving for memory. For all cloud providers, VM sizes always scale up both vCPU count and memory capacity. Oh! And cost. Can't forget about cost.

This kinda leaves me in a conundrum with how much processing power I'm paying for that's going completely unused, but I'm struggling to keep those costs down because I need the memory and I'm still scrapping by with the amount of memory I can use. Of course, there's also the other meters I'm being charged for. Like load balancer traffic, storage, and whatnot.

For a while I could afford it, but maaaaan... The cost of living has gone up. Was it a bad idea to run Mastodon in a Kubernetes cluster in Azure? Not at all. I saved more by cramming everything into a Kubernetes cluster than I would have by hosting it directly on VMs. Not just on money, but sanity too. It meant that I didn't have to maintain the underlying operating systems.

What do it be now?

Like I said, our Mastodon server is now a hybrid on-prem/cloud infrastructure. Some hosted in my own network and some hosted in Linode. The costs are much lower. Currently there's only one on-prem server, but I intend to expand that.

What hardware am I using on-prem though? A Raspberry Pi 5 with 8 GB of RAM. It has plenty of compute power, with its quad-core ARM Cortex-A76 processor, minimal power draw, minimal noise, and it's cheap. In the cloud? I've got three VMs with varying compute and memory.

The Raspberry Pi 5 I'm using.

It's still a Kubernetes cluster too. That was going to be a given. I'm utilizing the excellent k3s distribution. The three cloud VMs are the primary control plane nodes with etcd and the on-prem servers being agents in the cluster. It allows for high-availability and spreading the distribution of the various processes across both the cloud and on-prem.

I'm currently connecting all of these together with Cloudflare WARP's private network capability. I don't intend on making that the long term solution and I'm working on creating a private network with just WireGuard. That being said, it 100% gets the job done and latency between the nodes is within the 30-40ms range. I'm also connecting all of the public facing sites with Cloudflare Tunnels, so I'm not directly exposing my home network to the internet.

How did the migration go?

Initially the migration went fine. I changed the DNS for the Mastodon server to point to a maintenance page so that nothing would interact with the database. Then I migrated the database over to the new infrastructure (Over 30 GB of data to transfer) and started provisioning the Mastodon containers.

It seemed to be fine for a bit, but there were some pretty nasty issues that popped up. There were the typical temporarily scaling up the Sidekiq queues to play catchup with the servers we federate with, but things like that pale in comparison to two issues that took a while to resolve.

Database corruption

The first major issue was database corruption. I can't remember all of the details about how it got corrupted in the first place, but it was a mess trying to fix it. I woke up the morning after turning the cluster into a highly available cluster and noticed all of the alerts that sprung up, while I was asleep, that the server was down. I quickly started looking at the logs and saw that the volume for the database had corrupted. It wasn't just the primary database, the replicas were too.

Ah! No biggie! I was doing volume snapshots and backing those up, so it should be fairly easy to recover. Right? Right!?! Yeah, that's going to be a hard no. Each of the volume snapshots I had were corrupted. I was legit panicking.

The server was down for the majority of the day and I was frantically trying every possible method of repairing/recovering the Postgres database that I knew. None of them worked. I was almost at the point of calling it a loss and having to start fresh, but that would be devastating to both the content we've posted and our state in federating with other servers.

Wanna know how I was able to recover it? In a Hail Mary attempt, I created a new database and forcefully copied the data from the failed database volume into the new one. That actually fixed it. I shit you not, that got the database into a functional state again. I brute forced my way into recovering the database.

The message I sent after I got the database working again.

That's on me. I should have known better that volume snapshots aren't the best way to backup a Postgres database. I have a two tier backup approach now:

  1. Continuous backup of the WAL files.
  2. A weekly volume snapshot.

Restoring with the backup of the WAL files is the first option, but, if shit hits the fan, the volume snapshot and the backup of the WAL files is the second. I should have done this from the beginning, but I didn't. This current setup should make disaster recovery much easier.

Slow network traffic

The second major issue was with network traffic in the container network being stupid slow. Mainly between the cloud nodes and the on-prem node.

When I first set everything up, I was getting network speeds that would match up to what I would expect: 200-300 Mb/s.1 Then... It just dropped. Dramatically. We're talking about it going down to only 1-15 Mb/s. It was bad and it would make trying to use Mastodon a real pain. The problem affected all nodes in the cluster, but it was more bearable between the cloud nodes. I essentially had to take the on-prem node out of being assigned the pods for the frontend and the database and relegate it to only the Sidekiq queues.

That was only a band-aid fix though, because, like I said, it affected all of the nodes. So I had to do a lot of troubleshooting. It took me a week or two to pinpoint down the specific problem.

Direct node-to-node traffic outside of the container network was fine and even node-to-container traffic was fine; however, container-to-container traffic was problematic. It wasn't just slow, but network packets were being dropped, which is part of the real reason why it was being so slow. I was able to figure that out by deploying a container for iperf3 onto all of the nodes and running tests between all of them in different configurations (node-to-node, node-to-container, container-to-node, and container-to-container).

So what ended up being the issue? There were two:

  1. There was a MTU mismatch between the container network and the Cloudflare WARP interface. The virtual network interface for WARP has a MTU of 1280 bytes, but everything else typically had a MTU of 1500 bytes or 1420 bytes. This was causing packets to be dropped. I had to force k3s to bind to a specific address, with --bind-address, to get it to utilize the correct MTU.2
  2. There was also how TCP packet congestion was being handled. By default, the vast majority of Linux kernel configs ship with Cubic as the default TCP congestion control algorithm. Switching to BBR (Bottleneck Bandwidth and Round Trip Time) helped alleviate that. Pretty drastically too. Considering what BBR was designed for, it makes sense that it would work much better with our network setup.

Making both of those changes fixed the networking issues I was seeing. I haven't seen any problems with it since.

How's it going post-migration?

Really well! The initial problems are gone and it's been rather smooth sailing since then. There have been a like one or two minor problems since then, but nothing major. In fact, the hybrid approach I've chosen has worked extremely well. My ISP had an outage late one night, but we only had a few minutes of downtime as all of the containers that were on the on-prem node were being spun up on the other nodes. So it's working really well!


  1. You might be thinking, shouldn't that be higher? Not really. There's going to be network performance degradation in a container network, so it's to be expected.

  2. Fun fact! I haven't been able to apply updates to k3s because a bug was introduced in v1.30.1+k3s1 that, when the --bind-address argument is provided, caused the kubelet to not work properly.