According to our customers, the most common challenges that technology companies face include system failures, the time it takes to restore services, security issues, and outdated technology stack. Most of these issues point to problems in infrastructure, the application, and/or the deployment process.


In this article we will look at each of these factors through the lens of setup and optimization and suggest ways to minimize chances of their failure. We will also underline the importance of observability and its role in maintaining your systems’ stability and health.


Effective architecture design

Working on architecture design, the rule of thumb is to proceed from your applications’ performance profiles and the way they align with your business goals. Whatever type of architecture you choose, it should provide reliable and consistent performance.
Focusing on the questions below will help your team make an effective strategy:


  • Gather as much information on your existing workloads as you can. Where do your applications currently run, and who uses them? Are there any bottlenecks in compute performance, memory, or networking? What can be done to fix them?
  • Are your systems capable of maintaining production workloads whilst aligning with SLA? Perform capacity planning and workload estimation to find it out.
  • Follow best practices of fault-tolerance and reliability when architecting your systems. Make sure all critical components have redundant design and can take over in case active ones are down. In case of a cloud setup, spin your solution across several zones for better availability. Design your systems to be scalable and be able to add and remove resources in response to changes in demand.
  • If you are in a cloud, make sure that the suite of services closely matches your operational needs. Choose the types of services that best suit your application logic and optimize their configuration to achieve the highest performance of all components.
  • How reliable is your infrastructure? Do you have strategies in place to quickly restore your infrastructure or system components after a failure? Always automate and test your failure management procedures to prove they are working.


Stack modernity

Nowadays, as more products become digital, being technologically advanced is a must.. First, you can take advantage of the latest features available on the market and broaden your technical capabilities. Second, an up-to-date stack makes automation easier allowing for a faster development cycle, less bugs and smoother deploys. Last but not least, it helps you organize processes in a more efficient way, bringing more value.

Automation of processes

Automating routine work is an effective way to save time and free up human resources for more creative tasks. It helps to accelerate the development process, improves code quality, simplifies infrastructure management, and overall, boosts business processes. The main areas of infrastructure automation include:


  • Infrastructure as Code (IaC). Defining infrastructure with all its constituent resources in code allows to quickly provision new environments from predefined templates, easily update them, and reproduce as many times as needed.
  • Continuous Integration and Continuous Delivery (CI/CD). Code delivery pipelines allow you to automate most deployment challenges, starting from typical build, test and deploy up to more complex ones, including deploying to selected environments, performing rolling updates, etc.
  • Autoscaling. Automated resource allocation in a cloud allows to satisfy the demand while maintaining optimal balance between over- and under-provisioning.
  • Automated monitoring. Advanced monitoring systems track changes in threshold values and automatically take the desired action, for example, restarting a service, adding or removing resources, or notifying an on-call engineer.



Security strategy is a set of measures aimed at protecting your data, systems, and assets against attacks and potential threats. Architect your systems according to best practices in identity and access management, infrastructure protection, detective controls, and data protection. Follow a defense-in-depth approach to apply security at all infrastructure layers. Remember: your company’s security posture is only as strong and effective as the individual practices that contribute to it.



Code quality

Needless to say, poor code quality may well result in incidents, rollbacks, and production issues. To measure the quality of deployed software, DevOps uses the Change failure rate – one of the DORA metrics that indicates how safe your code is against failures. The Change failure rate captures the percentage of code changes that led to rollbacks or any type of production failure; the lower its average, the fewer errors a code contains.


There are several ways to reduce the Change failure rate and improve code quality. For example, consider introducing test-driven development (TDD) – a core Agile practice that implies writing unit tests for each program component and running tests to check whether the component works correctly. If it doesn’t, you need to refactor the code until it conforms to the acceptance criteria. Make sure to cover with tests as much of your code functionality as possible, ideally no less than 80% of the existing codebase.


Use code reviewing to improve your code quality. Pull requests are an easy way to do this. A team leader monitors all proposed code changes before applying, thereby minimizing chances of erroneous code reaching critical environments.


Also, try using feature flags – a DevOps technique that allows switching features on/off during runtime. If a flag is on, new code is executed, and if the flag is off, the code is skipped. The technique enables gradual feature rollouts, and easier bug fixes.


Cloud readiness

As most software is now delivered as a service, compliance with cloud standards becomes a key factor. First and foremost, this applies to an application’s design: how suitable it is for deployment on modern cloud platforms and how easily it can be containerized.


Applications built as cloud native are initially designed to leverage the advantages of cloud services, including microservices architecture, maximum portability between runtimes, and scaling.


To ensure your application is cloud-ready, evaluate it according to the 12 Factor App principles – a methodology for building SaaS applications. Despite being created a decade before containers and Kubernetes became mainstream, the 12 Factor App principles are still relevant today and synthesize best practices for building scalable, portable, and cloud-ready applications.



It is common for developers to consider security as an afterthought, when it should be an indispensable part of the process starting from early stages of development. Security by design leverages this approach and focuses on making software as secure as possible during the development process.


Many cyberattacks are made possible by exploiting software vulnerabilities – programming mistakes or oversights that left applications, servers and websites exposed to threats. Adhering to security by design principles – a list of security standards developed by the Open Web Application Security Project (OWASP) – helps developers build applications with high security levels and minimize the chance of successful cyberattacks.


Building reliable systems in DevOps: factors of success — SHALB — Image №2


Deployments and code delivery

Software deployments vary across different teams, organizations, and services. And, whether we mean to or not, release can turn into a very challenging process with unpredictable results. Creating a foundation for better software delivery will help to make your deployments more reliable and mitigate possible impact. Consider implementing the following practices:


  • CI/CD. The continuous methodologies of software development imply triggering automated mechanisms that build, test, and deploy code each time a new change is submitted to a codebase. Since every stage of the CI/CD process is covered with tests, this filters away potential issues early, acting as a quality gate that blocks erroneous code. Well-established code delivery pipelines provide support for faster iterations, boosting the efficiency of your teams.
  • Iterate in shorter cycles. Working with smaller versions of changes means that your code is more readable and, as such, easier to test and troubleshoot. Releasing in smaller sizes is also more efficient in that developers receive feedback faster and fix identified issues sooner.
  • Deployment strategy. You should choose the one that works best for your application, or combine strategies based on service-specific needs, its performance profile, and possible business impact. For example, basic deployment would suffice for regular applications but not for critical ones whose interruption of service can affect business operations. In this case, less risky canary or blue-green strategies would be preferable.
  • Automated rollbacks. To a large extent, this depends on the efficiency of your CI/CD pipelines and chosen deployment strategy. Provided that all the steps are automated, rollbacks can minimize the impact of a failed deployment or a broken version, making the release process less stressful.



To prevent incidents from happening, we have to be watchful of the signals that our systems send out. Observability delivers insight about the state of our systems at any given moment. It does so by presenting health and performance metrics and distributed traces and logs within a whole picture.


There are two types of metrics that we want to monitor:


  • infrastructure-specific that relate to the operational layer of our systems, for example, server resource usage, database utilization, or deployments fail rate
  • application-specific that deliver information on a given service and help to assess its business logic, for example, application availability (SLA), error rate, database, and query time.


Some parameters may refer to both infrastructure and application, such as RAM or CPU usage, however, they define different things: usage of resources by a given machine (physical or virtual) as an infra-specific metric, and consumption of RAM or CPU resources by an application that runs on the machine, as an app-specific metric. We need both to obtain a panoramic view of our systems’ behavior.



Tracing is an indispensable tool for microservices architectures. It allows us to gather information directly from the inside of services and bring it together with everything else we know about the system. Such an approach delivers the necessary information for developers to detect slow calls, long-running operations, failing services, and where they happen.


Unlike monitoring metrics that record regular performance data, logs register events both occasional and important, such as access logs for a web service or various error conditions.


Analyzing logs enables tech teams to pinpoint a problem and take actions to prevent it from happening in the future, thus allowing for efficient debugging and optimization of the performance of infrastructure and services.



In this article we have outlined the main reasons why your infrastructure, application and/or deployment could fail, and provided general recommendations on how to tune them for better performance. You can check how well your systems are prepared for production challenges by trying out our real-time infrastructure audit. Or, to obtain a detailed expert overview, book a call so that we can discuss your issue in-depth.


SHALB provides DevOps, SRE, and System Architect Services for software teams throughout the world with 24×7 support. We build and maintain fault-tolerant cloud native systems using the Infrastructure as Code approach, Kubernetes, Serverless, and Terraform technologies.