Learning to trust the machines and the process in DevOps

devops infrastructure

25 April, 2019

Chris Yates

VP of Marketing

“It worked in staging”

… says the dev team. And the ops team replies: “it must be a code issue."

Developers and ops people sometimes have different perspectives on why a deployment went wrong. Was it the code? Or the infrastructure? In the end, it’s the user that suffers, and that user doesn’t care about why your product was broken. They just want it to work, so they can do their work.

It really did work in staging

Everyone on the team wants to get things right the first time, deploying a quality product with every feature release. That's why development teams adhere to deployment best practices. They write tests. They write code. They test that code in staging environments that mimic the production environments that code will run in … as closely as possible.

Every member of the team approaches projects with not only best practices, but with best intentions. But, as the saying goes, we're all only human.

"Sometimes, humans miss things."
-- Computers

A slightly different network configuration here, an older version of a PHP extension or NPM module there, and suddenly the places where code is being tested before launching it to the world aren't quite perfectly in sync. And the code that ran fine in QA falls down in production.

It gets worse when teams meet bottlenecks. For example, a limited set of environments, or worse, a single staging environment that code has to pass through. Who hasn't done a 'hotfix' to resolve a critical problem in prod (without the same testing process we'd normally use) because we couldn't wait to send it through QA or staging environment in use by others?

It really was a code change

Or maybe it was a different version of the XYZ module after all? How do we know? Often management of code and infrastructure configuration are siloed, sometimes driven and managed by separate teams.

Though software development has come a long way---test-driven development (TDD), rigorous code review processes, automated testing, and more---it remains rare to manage the whole system including infrastructure and data with the same tools and process.

As a result, it's difficult to tell at a glance which change to the system was the direct cause of a fault, and sometimes it's hard to know that the system state has changed at all. That's due, in large part, to having separate tools and processes to manage change in code versus infrastructure, dev versus ops.

Driving Dev/Ops alignment

Developers are measured by the features they ship. Ops is measured by uptime and performance. The most important person, the user, doesn't differentiate between who's at fault if the product/site/service/experience you're offering them falls short of their expectations.

How do we solve for the user's concern and deliver features they want, and reliability they expect?

Stop working in silos

Development and Ops need to take a unified approach to managing change. The scripts and tools that manage infrastructure should be subject to the same process and rigor, and managed as a system, instead of independent parts. At a glance, you should be able to tell when a version of a runtime was updated, or an application code update was made.

Platform.sh takes silo-smashing to a new level. Everything---from the app code, to the routing between services, the version of a runtime (e.g. PHP, Python, Node), the continuous integration (CI) scripts, and the system topology---is managed in Git, making those changes transparent and auditable by all.

Trust the machines

When humans are involved, drift happens. Changes that are made to a production configuration don't always make it back to staging and development environments. Humans take shortcuts. Humans forget things. Sometimes it's a cost concern, sometimes it's a technical one, sometimes it's just human error. But when differences emerge, systems break, and tests are no longer valid. Automation is the answer. Replicate all changes made to production environment configurations to your staging and dev, from infrastructure configuration to code.

Oh, and let's not forget about the impact of data. You'll want to sync your production data (scrubbed as needed) back to your other environments to get a true "like for like" test. All of this automation can be difficult and time consuming, not to mention expensive. Therefore, it's uncommon. Further, due to organizational, process, and tooling silos, "dev," "stage," and "prod" are only loosely linked in many organizations. So it's a challenge just to get access to the systems that need to match.

Platform.sh makes perfect clones, with zero effort. We've built Platform.sh as an end-to-end platform for developing and delivering applications. Because we're in the production runtime, and we've built around Git, we give our users the ability to instantly synchronize their production environments---including infrastructure configuration, CI scripts, and all the data---to their development and staging environments, with a single click or command, in seconds.

Get nonlinear

System stability is often negatively impacted by human activity. We humans break rules. We find workarounds. We do things ever-so-slightly differently each time, especially when we're tired or stressed or hungry.

When we're faced with a demand to ship, under pressure, we may overlook some testing. Or work around a congested staging environment and commit to prod.

To prevent this, dev and ops teams should again work towards greater automation and deploy tools that allow them to work in parallel, rather than in series, on environments that closely mimic production systems.

Platform.sh gives you infinite staging capacity. We've taken our Git-driven workflow to its logical conclusion. When you create a branch in Git, we create an environment. In moments. Including a byte-for-byte copy of stateful services like databases, queues, and search. This way, every feature, every developer, and every team can work in their own perfect clone of prod. Teams can test changes to code, topology, or dependencies. When they're sure it works, a Git merge deploys those changes to production, automatically, so developes can feel confident.

Getting predictable

The mantra of a DevOps culture should be "Automate everything." Trust the machines. Once deployments become sufficiently automated, they become non-events. You should be able to deploy your system at any time---even on a Friday afternoon.

Getting there requires both organizational alignment around the goal of predictability with speed, as well as the right toolkit. That combination enables teams to deliver what users care about---reliable, featureful apps that run well, all the time.