Getting GitOps right

10 November, 2020

Larry Garfield

Director of Developer Experience

The Git version control system is 15 years old and has by all accounts taken over the development world. It's unusual and weird to find a serious software project that isn't using Git today (though a few holdouts still have lots of auxiliary tools built around Subversion they haven't managed to jettison yet). The great thing about Git, though, is that it's capable of so much more than just a place to dump source code.

In recent years, many organizations (including Platform.sh) have adopted Git as a central part of their application deployment process, not just their code storage. Collectively this change has become known as "GitOps" -- because adding an "Ops" suffix to something makes it cool; Internet rules -- but what it means in practice varies widely by the organization. Like any vaguely defined trend, not all implementations are created equal.

The term was coined in 2017 (as far as I can find) and at its core boils down to three defining points:

Git as the single source of truth of a system
Git as the place where we control all environments
All changes are observable/verifiable

These are good aspirational GitOps principles, but what do they mean in practice?

For Platform.sh, it means "what we've been doing since 2014." Yep, Platform.sh is the hipster of GitOps.

Git’s extensibility for infrastructure as code

Git as the single source of truth for code is seemingly obvious. It's a version control system; that's what it does. The important details are the definition of "single" and "truth."

What makes Git useful for more than just code storage is that it's extensible. You can attach hooks to various events, pushes in particular, and trigger arbitrary actions on those events using the code in the repository as input.

That, in turn, allows you to put more than "just" code in Git. Or rather, more than just application code. For example, you can include definitions of what your infrastructure should be, declaratively. One of the triggers attached to a push operation is then "make my infrastructure look like that," which requires software that knows how to change your infrastructure accordingly. Those actions are usually run by a job server, which is a fancy way of saying "a server that runs scripts when stuff happens."

That's the origin of "infrastructure as code," a pattern that developed somewhat before GitOps but is a key part of it. The Git repository then becomes the source of truth for both application code and the infrastructure needed to run that code.

There is an important caveat here, though. We mentioned "software that knows how to change your infrastructure accordingly." By infrastructure, we're referring to what language runtime is installed, what other packages are installed, what other services are in use (databases, cache servers, search indexes, etc.), which of them are allowed to talk to each other, and so on. That, in turn, requires software-managed infrastructure, or what is more commonly called "cloud computing." (Not to be confused with "The Cloud," which is market-speak for "someone else's hard drive.") And cloud computing, in turn, is almost always implemented using containers. And all of that is an awful lot of moving parts under the hood.

How GitOps transforms workflows for infrastructure management

The combination of a Git repository and container tooling allows for the Git repository to serve not only as "the place information is stored," but as "the place changes are made." Code changes are obvious, but the repository also becomes the place that infrastructure is changed. If Postgresql needs to be updated, that becomes a change-in-Git just the same as updating a code dependency. If a second application runtime needs to be added, say for a new background NodeJS process, that becomes a change-in-Git just the same as writing a new custom plugin for your application.

Those changes include the creation or destruction of new environments. In the most common scenario (and the one used by Platform.sh), every Git branch is, or can be, a new environment. Because the infrastructure is all software-driven containers, creating a completely new instance of Nginx, PHP-FPM, Elasticsearch, MariaDB, or whatever else for each branch becomes "just" a coding problem. Thus, "make a new test environment with its own collection of servers" turns into "push a new Git branch, and maybe flip a switch."

The end result, and the desired result, is folding all of the various buttons and dials and switches to control the entire application, across multiple environments, into variations on "edit something in Git, commit, and push," which is a workflow developers are intimately familiar with.

Or, put another way, GitOps replaces DevOps by replacing the ops engineer with a small shell script in a Git hook.

Challenges beyond centralized source control in GitOps

Of course, it's not quite as simple as that. Life never is. While Git as the central source of truth has plenty of advantages, it doesn't solve all issues. Those need to be solved by additional tooling that is part of the infrastructure management layer rather than of the application-specific bits (which all belong in Git).

One area in which Git is quite useless is logging. Logs are important, in particular logs of when and how a given task succeeded or failed, but you would never put logs in Git. (Someone out there is about to say "challenge accepted." Trust me, it's a bad idea.) When something goes wrong (it always does), you want to have a record of not just which commits happened when but which commits triggered what script, which deployed which code, which code caused which infrastructure changes, and which infrastructure changes caused what failure. That requires extra tooling.

Another aspect is secrets management. API keys, passwords, and other secret or semi-secret data needs to be kept separate from your code. Committing keys to a repository is a sadly common security mistake, and you want to be able to avoid that. That requires having some other mechanism to track secrets, potentially encrypted, and inject them into an environment when it's created.

You will also need to inject environment-specific information. API keys are one such piece of information, as you generally don't want to use your production payment gateway key in a throw-away test environment where you're running integration tests. (Don't ask how I know that.) But it also includes database credentials (since each environment is a separate server), the domain name an environment is running on (which an application often needs for security reasons), or even the performance-sensitive information like the number of CPU cores available (for applications that are multi-threaded or spawn multiple worker processes). None of that can or should be in Git, but managed separately and injected into each environment.

Another tricky question is validation. Infrastructure-as-code is great when the "code" has no bugs. Which never happens, because code. So at what stage do you validate the infrastructure definition? Before it's committed? After? When you try to update the environment and it fails? There's a number of possible ways to solve that, all with trade-offs.

While GitOps, as a concept, calls for all changes to be observable and verifiable, Git itself only helps partially. The rest falls to the tooling around it.

GitOps best practices to avoid implementation pitfalls

In the words of Albert Einstein, “The definition of genius is taking the complex and making it simple.” Genius is quite hard, of course. Making GitOps-type workflows work is a lot of work, and presenting a reliable and straightforward interface to it is even harder. The ideal vision is "you push code to Git, black magic happens, and your application is safely updated--tada!" But that "black magic" is doing a lot of lifting.

There are a lot of moving parts to a successful GitOps configuration, more than most people realize, and some implementations leave out many parts. Often these are cobbled together using Kubernetes, kubectl, Helm, OpenShift, Docker Swarm, Mesos, or Rancher hooked up to Jenkins, Travis CI, or Circle CI, plus an ELK stack, a couple nginx instances, and various other tools that sort of work together most of the time, if you configure them just right.

It's also easy to get wrong. For example, some GitOps-style implementations set up a separate repository for each environment rather than just a branch. That creates entirely pointless complication and makes everything harder. Please don't do that.

Most tools used in setting up a GitOps-style workflow are also extremely flexible. While flexibility is good, it also runs into Parker's Law. ("With great power comes great responsibility.") Often, they have so many moving parts and configuration options that figuring out how to configure them just right is a full time job unto itself. That's why there are even tools (Helm) to manage tools (Kubernetes) to manage tools (Docker) to manage tools (Linux namespaces). And it's YAML all the way down.

If GitOps were easy, it would be called Platform.sh

Platform.sh is, in a sense, a "not having to build and maintain a complicated GitOps pipeline"-as-a-Service. We've already built it and made smart decisions about how certain pieces should work (like, don't have a separate repository for each environment; we don't support that, don't do it), so your number of decisions is reduced to a more manageable and appropriate level.

Automation Credit: XKCD

Since we knew we were building that level of system from the start and didn't need to make absolutely everything an adjustable dial, that meant we were able to simplify the conceptual system greatly. There are many more moving parts behind Platform.sh than most customers ever see, but there's still far fewer moving parts than an equivalent Kubernetes/Jenkins/ELK/AWS home-grown solution.

In Platform.sh's case, for instance, "all changes are verifiable" is one of the reasons we use a read-only file system for applications. While that is problematic for some poorly written legacy applications that try to mix their environment configuration and application configuration into a single file (you know who you are), it guarantees that there are no "fix it in production" hot fixes that get lost as soon as a new build is triggered.

Secrets management is handled through environment variables, which can be marked sensitive or not and vary by environment or not.

Environment-sensitive information, like the domain or database credentials, are also injected via environment variables. That makes it a little more secure, too, in part because developers never even need to know what the credentials are. Passwords are auto-generated, and the user never needs to interact with them at all.

Config files validation, activity recording, and simplified inter-container configuration

We made the decision to validate configuration files on Git push. That's not the only possible place to do so, but it's what makes sense in our system. That means if a Git push is accepted, the configuration files at least validate syntactically. They may still have bad configuration in them you don't want, but that's just a bug like any other. (That's what ephemeral staging environments are for!)

Platform.sh doesn't have a dedicated, user-accessible job server. Instead, we provide direct hooks for the three most common "jobs":

Building the application (the build hook)
Exclusive setup after the application is deployed (the deploy hook)
Non-exclusive setup after the application is deployed (the post_deploy hook)

Each is an arbitrary shell script that can be of whatever complexity necessary for your application, without having to think about Jenkins or Travis CI or other extra configuration.

All actions are recorded as Activities in our system, which can be exposed either programmatically or through the web UI. You can always go back and find out not just who edited what code and when, but who triggered what action and when.

Complex inter-container configuration has been reduced to the bare minimum of YAML, which translates to appropriate service configuration under the hood. For instance, you don't need to manually keep track of what containers have what IPs and what open ports. Instead, the .platform.app.yaml file identifies services you want to connect to (the relationships section), and the necessary network configuration "just happens," resulting in a predictable internal domain name and environment variable you can rely on.

Simplify source code management with Platform.sh’s GitOps

GitOps, as a concept, is wonderful. It allows you to use a tool developers already know well, Git, and use that as the control panel for most of your deployment life while retaining a single source-of-truth for both your code and your logical infrastructure. And the larger your team, the larger the benefit.

GitOps as an implementation, though, has far more moving parts than you'd expect, even with available off-the-shelf tools. Like any complex system, making it seem simple is hard and easy to get wrong. For most developers that make their money off of the code they write and the service they offer, maintaining a GitOps infrastructure is an incidental complexity they simply don't need. Most would be better served by a GitOps-in-a-box service. And that's what Platform.sh offers.