Upgrading dockerized applications in Kubernetes might seem like a straightforward task for a seasoned engineer. However, what appeared to be a simple upgrade of the Cassandra database turned into a tortuous, months-long endeavor. Read on to know about our challenging experience and the lessons we learned from it.

 

Backstory

A client approached us with the task of upgrading their Kubernetes cluster, which hosted their cloud software. This was a private GKE cluster, and Google was about to stop supporting its specific version. Failing to upgrade could result in system failure, leading to downtime and potential data loss—a devastating blow to the client’s reputation.

 

The customer’s business focuses on organizing and managing parking lots in major cities. They develop and rent out a SaaS platform that enables easy startup and management of city parking businesses worldwide.

 

Considering the international scope of their business and plans for future growth, the customer chose Cassandra as the primary database for their application. As a distributed database, Cassandra supports synchronization across multiple nodes located in different regions. This means that lag between nodes doesn’t impact its performance. To enhance Cassandra’s scalability and availability, it was deployed within a Kubernetes cluster. 

 

Unprofessional approach to infrastructure building

The biggest problem with the customer’s existing solution was its inconsistency and non-compliance with best infrastructure-building practices. Without an in-house DevOps team, the client hired various freelancers to configure their infrastructure components. Consequently, their systems were fragmented and difficult to manage. In hopes of improving their setup, they reached out to SHALB for assistance.

 

The first issue we encountered was that all applications, both stateless and stateful, were running in a single namespace. This bad practice complicates cluster management and impacted our work from the start. When we tried to clone the Kubernetes cluster for testing, the process failed due to an error in one of the applications. If the applications had been running in separate namespaces, we could have easily resolved this by disabling the problematic namespace and manually configuring the troublesome application after the cluster was initialized. Furthermore, this architectural decision compromised cluster restoration since Google retrieves clusters from backup by namespaces.

 

Despite positioning itself as an international business, the company’s infrastructure was only partially codified. Whenever they wanted to launch a new environment they had to reproduce stack configurations manually, which required a lot of effort. Additionally, their Kubernetes cluster lacked autoscaling and was constantly operating at full capacity. According to our estimations, implementing even basic autoscaling could have reduced their cloud costs by up to 60 percent.

 

We prepared and presented the client with a comprehensive infrastructure improvement plan, the first step of which was to describe the infrastructure with Terraform code. Codifying the infrastructure is essential for process automation, as it ensures minimal recovery time after failovers, simplifies scaling to multiple regions, and optimizes cloud costs by providing transparency of the services running in the cloud. The roadmap also emphasized the importance of running applications in separate cluster namespaces and implementing autoscaling.

 

Kubernetes cluster upgrade

Upgrading the Kubernetes cluster required upgrading all the applications running within it, including ElasticSearch, Cert Manager, and Cassandra. These applications were deployed using Helm files and custom bash scripts embedded within a dedicated image sourced from a public GitLab repository. The problem arose with the Cassandra application Helmfile: it was no longer supported since Cassandra developers switched over to an operator to deploy the application. Consequently, the deprecated Helmfile couldn’t deploy the K8ssandra Operator to a Kubernetes cluster of a newer version. Furthermore, our attempts to upgrade the image from the public repository resulted in the disappearance of the bash scripts that managed application deployment.

 

Finally, we had to resort to manual upgrades for each application instead of conducting comprehensive upgrades through code changes.

 

Upgrade of Cassandra clusters

The customer used the K8ssandra Operator to deploy Cassandra to Kubernetes. On one hand, operators streamline application installation and management on Kubernetes. On the other hand, they also introduce another configuration layer that can potentially lead to errors. And this was the case in our situation.     

 

The issues already started emerging during the initialization of a new Cassandra cluster. As it turned out, these problems stemmed from the specifics of the private GKE cluster—a highly secure Kubernetes implementation hosted on Google’s infrastructure. The enhanced security measures of the private GKE cluster implied configuring firewalls even between applications running within the same cluster, and that was the issue. 

 

Finally, we managed to install the cluster using the K8ssandra Operator by opening the port for the application and adding it to the cluster firewall rules—an unconventional solution proposed by an Indian developer. Interestingly, we received no feedback from Cassandra developers throughout this process.

 

Data synchronization between clusters

One of the default features of Cassandra is its support for cluster synchronization: when a new cluster is connected, Cassandra treats it as a new datacenter that requires synchronization. After synchronization is complete, you can switch off the old cluster and designate the new one as the main. We successfully tested this approach in the staging environment. However, when we tried to replicate it in production, we encountered a failure with an error that we couldn’t find any information about, even after extensive searching on Google.

 

After browsing various forums, we stumbled upon a potential solution suggesting the deletion of the CustomResourceDefinition (CRD). Upon reinstalling the cluster, the operator would generate a new CRD, which supposedly had to resolve the issue.

 

Unfortunately, this proved unsuccessful. To make things worse, deleting the CRD led to the failure of the production cluster and the loss of the primary database, on which other clusters relied for synchronization.

 

Dysfunctional backups

The situation was certainly frustrating but we were prepared for it. Surely enough we had copied all the data before deleting the CRD, so we had at hand a Cassandra backup from Medusa, a dedicated tool for Cassandra backup and restore. Initially, we were optimistic about using this backup. However, to our dismay, restoration from the backup failed due to some dysfunction in Medusa. Despite Medusa reporting successful restoration, each attempt resulted in an empty database. The thing that ultimately saved us were the Persistent Volumes: fortunately, they remained in Google Cloud after the failure of the old cluster, preserving all of its data.

 

In the end, we opted to create a new cluster, reducing its size to just one node, and connected this node to one of the remaining volumes. Thankfully, it worked – the data was successfully  restored, and just in time: it was already morning, and cars were starting to appear in the customer’s parking lots.

 

Conclusion 

Finally, considering all the issues outlined above, we opted for a more conventional and time-proven approach: deploying the Cassandra database outside of Kubernetes. In both staging and production environments, we provisioned virtual machines, installed Cassandra without using the operator, and successfully migrated all the data. This strategy proved effective, with Cassandra functioning smoothly and without any issues. For backups, we implemented a snapshot-based solution, which not only meets the customer’s requirements but also simplifies maintenance and enhances reliability.

 

Key lessons learned from this experience include:

 

  • Working with an infrastructure set up by others can be exceptionally challenging but feasible.
  • Unknown infrastructures may harbor hidden issues that can significantly prolong seemingly simple tasks, such as upgrading Kubernetes versions. It’s crucial for both you and your client to acknowledge this before starting a project.

 

You may ask why we would tell you all this. We shared these insights to underscore SHALB’s commitment to excellence and the proficiency of our team.  Contact us to witness it firsthand! We build next-gen infrastructures according to best practices, and, which is no less important, never let our customers down.