A dreaded interview question
As an interview question I ask “What are some of the advantages and some of the disadvantages of static binaries against dynamically linked binaries?” The correct answer is “static binaries are easier to deploy but harder to update.” You get bonus points for saying “there will be differences in startup times” (if libraries are already cached by the OS, dynamic linking might yield better performance, or on the contrary, if on a cold cache the OS has to seek all over the disk to find many small files, going dynamic may take longer). “Memory footprint will be higher for static binaries” is again a correct answer. And of course “Disk space utilization will be higher for static binaries”.
Now, you may say that except the first thing - “harder to update” - all of these other considerations have little to no consequence; disk space is so cheap, memory is getting cheaper. So… who cares? Deployment is a pain well worth a bit of overhead. And in the use-cases of yonder you might be right, Linux never really suffered from DLL Hell, still the upside sounds compelling for static binaries. If you have a single instance of your program running on a server the overhead will be negligible in all but the most extreme use cases. And you think, well updating may be just a teeny-weeny harder, but with all my DevOps goodness I should be fine.
Come containers.
We like containers. Platform.sh runs as a high density grid of micro-containers. I posit that containers compound this issue; and that where these days the sorrow starts ebbing (it will run deep and wide, bear with me).
We know why containers are great, right? And it’s not only the performance gains (as measured against other forms of virtualization). They give us a new abstraction. And a damn good one. Applications are more than code. More than the executable that we run. An application is one with its underlying infrastructure. It is nothing without its configuration, its data, its place in a topology. Just a bunch of useless bits. And the container is just that “the minimal set of stuff needed to run the application, bundled together in one place, with no external dependencies”. This is precisely the abstraction that Docker added on top of LXC by not being “a lighter vm” but insisting on the portability of a single application. Or more precisely a single service that can be a part of an application.
So basically containers are the new static binaries. They take the idea further. They “statically bind” not only the code but everything else. Basically everything we said about static binaries will be true for “containers as applications”; They are easier to deploy. And they are harder to upgrade. And, by the way, the overhead they represent might also now be compounded if they are running static binaries. If you have thousands of containers running on server (and we do) the “inconsequential” overhead of those has just been multiplied by a couple of factors. The more we think about those containers as blobs the more they resemble good old virtualization. They become new “VM appliances”.
You would argue this is untrue because of the micro-services approach. Every container is a single service, more horizontal in nature, not an obese “appliance”. But here we see that it is hard to delineate precisely the bounds of what makes an “application”. These days an application will usually be an ensemble, a graph composed of multiple services; As such our single container, though already useful is not yet the whole thing. If the application is for example a web application, it may rely on not only an internal API but also caching mechanisms, a search engine, and maybe a second App to handle CMS content, there is a very good chance that without a specific Nginx config, it won’t run. Without all of these in a coherent state we are back to useless bits. With containers deploying each part of this service graph is easier. As each service will probably simply expose to the others a single TCP socket. The opacity here is extremely useful as an abstraction. But without careful architecture you get hit by the same penalties as static binaries: higher overhead, higher cost of update.
When looking at the orchestration layer, the opacity of each component, of each container, having everything bound together neatly in one place, without the need to know any of the internals, makes our life clearly easier. This is a kind of plug-and-play architecture. A lego of components. But at the price of higher overhead and probably long-term instability. It might be plug and play, well, like Windows 95 was, because the internal opacity can also bring us flakiness. Where in a traditional architecture we would have had maybe two or three servers for a simple application, we might now have the equivalent of dozens for the same functional perimeter. Each component is separated, but they way they are glued together can be unstable. Do you really want to try to switch off and on again your datacenter in order to resolve an issue? Plug and play is nice, but when talking about deployments consistency is more important. Please remember that this is not something imposed by the Container orientation by itself; LXC and other containerization-like techniques are just isolating a binary from the rest of the OS. With enough care we can pay very little to no overhead at all. If we work on the correct abstractions we can avoid the flakiness. More on that later.
Now, lets get back to the first counter argument to static binaries, the updating thing.
Updating things in a blob world
Imagine that we have a security issue in a library. Lets say in a popular, widely used one (that never happens, right?) We might very well have thousands and thousands of copies of the vulnerable routines hidden away in binaries, themselves hidden inside the blobs that are containers. There are very few tools to help us discover which those might be, and which binary used which library at which version. A problem like this is not something that is bound to happen soon. But this is something that is bound to happen. Imagine that many of the services you are running, have been pulled from a public repository of “container images”. How will you know which one is affected? And the more “micro-service” you went, the more of these images you are going to have. The complexity is staggering. Here at Platform.sh we can run above a thousand five hundred containers per host and have hundreds of hosts across which we run containers. As a rule of thumb, the smaller the granularity of services we run the happier we are, which is also why minimizing overhead is of extreme important to us. Without a consistent, traceable build process, without an orchestration layer, reacting to a critical flaw in a widely used library would amount to “a riddle, wrapped in a mystery, inside an enigma;” in the words of Winston Churchill.
Now, say what you may about Debian; compare this nightmare scenario, updating those opaque blobs inside opaque blobs coming from a third party… to updating a single shared library and gracefully restarting a bunch of servers. That day, The Debian Way wins.
There are of-course many ways to lessen the blow. And the first that comes to mind is imposing fully traceable, repeatable builds. This means that the good practice is to build your container images from source. It is much easier to find vulnerable containers through scanning their underlying code, and if everything is automated from build to deployment it means that you pay no cost for deploying the exact same version again (the built container, being simply a cache of the build process). The cost to redeploy an updated version is also marginal (patch the source, push to git, a single image is built and voilà.). This is how we do it in Platform.sh.
Containers in Platform.sh are truly still blobs; We build each one once. They are still a content addressable thing. If you deploy 100 MySQL servers in a grid, on each host it is going to be on, we are only going to pay the disk space and much of the memory penalty once. Everything shared will only cost the OS once. This is because all our containers are read only. We mount anything that might be mutable on a different filesystem, so the r/w parts are always explicit. When we need to upgrade a piece of software, this just means pushing the commit, and it gets built and deployed. And because the relationships in a topology are semantic, because we understand who depends on whom, now it is just an issue of refreshing clusters, which happens without downtime.
Abstractions
Containers and VMs are not only run-time concepts. It’s not only about their efficiency, they are also abstractions, and most of what we do in software is about expressing the correct level of abstraction. Abstractions serve us to hide complexity and put system frontiers. They allow us to separate responsibilities, they allow us to manage change. But Blobs by themselves are not abstractions, anyway these are not useful ones. They are the ABI, “Application Binary Interface” not an API, they are implementation details. For precisely this reason, having a “build process” for your apps and then your containers is not enough, not by a long shot. You want this build process to have other qualities, primarily that the more you go towards the outer layers of your build the more declarative and less imperatively scripty you get. In Platform.sh we not only build the Application, we also build the infrastructure. And the build is based simply on declaring the dependencies of the application.
Consistent Change
We want declarative infrastructures because we are pursuing what can seem as an oxymoron - consistent source based deployments (where the source moves). You want to be able to update everything, update it dynamically, update it safely, and update it all the time, but you don’t want the application to break all the time. So in Platform.sh the git commit that represents an application holds two things at the same time: an immutable reference to the source (application but also any dependencies such as the precise version of your database, or even the memory that was allocated to a service), and an immutable reference to the built containers that represents a hash of the binary blob (the built file system). You get the benefits of static binaries. The whole shebang, together, will always run, together, as an integral unit. It no longer has any external dependencies. But it the fact the what we manage together is not only the consistent state of each micro-container, but also the graph of services, their relations, the order in which they need to run, the order in which they need to stop. How to #ze them and how to gracefully restart them; How to clone them and how to move them between hosts. These are the orchestration level abstractions, that we believe to be an integral part of the application that allow us to confidently, continuously deploy.
With the extra layers of orchestration, by letting git be the repo of all things, you can now also update. Just push a new commit; We will rebuild everything that needs to be rebuilt, and know how to “diff” the current infrastructure from the one that is newly described and update the running environments so the new world represents its description. And now the infrastructure is at a new commit. It is again immutable. If you were to revert, you will get the precise state in which you were before. And you can always diff. You can always know what has precisely changed between two application deployment states.
Infrastructure rot
Building afresh, and declarative immutable infrastructures is what defends your application from infrastructure rot. Because like code, infrastructures rot, and maybe faster. A tweaked configuration parameter here, a changed memory allocation there; then cloned a VM or a Container (now running on different specs…) Why is this parameter there now? No-one will know. And no-one will dare change it. Infrastructures rot like code does. When fear of change comes. When there are no commit messages. Copy and pasted code has no semantics; It may work, but it rots. The same is true for any part of the whole infrastructure.
But Puppet and Friends?
But there are so many nice automation tools these days, you might say; And the code that runs the scripts is of course in a git repo… We love Puppet, and we love Fabric, we use both and all of these (with Ansible and Salt, and even Chef, are great useful tools). But as long as you are shuffling around Blobs and putting them into production, as long as the deployment is not managed with the application as a single entity there is no real difference between doing this and using FTP on production (well that’s a bit harsh maybe… still you have no real consistency guarantees, you have no guarantee that deployment and application will not diverge). You might of course be one of the golden ones, already doing continuous deployment and have heavily invested in both testing your deployment and in huge amounts of homemade tooling to make the whole thing smooth. When you continuously deploy with these methods, you may not be able to avoid failure, but as you detect it at a smaller granularity, you don’t let the thing rot. It may continuously break, but you can continuously repair it. So, failure here is in no way unavoidable, avoiding it is simply very costly.
Somewhat orthogonal to configuration management and deployers, we can see of late the emergence of an orchestration layer for containers; from the older Mesos project to the more recent Swarm, Fleet, Flynn and Kubernetes (and probably a dozen others, seems to be a new one every week). Most still lack maturity, and from these, the one that accounts for many of the concerns I am raising is probably Kubernetes. It exposes some of the primitives that would allow with sufficient investment to create an orchestration layer that allows for consistent operations. But it is a very low level beast. It exposes a lot of moving parts.. which defeats some of the purpose. We wanted to “containerize” firstly because we wanted a simple abstraction. There are so many moving parts that transposing now a whole cluster becomes difficult. If an app is just a small Symfony thing, that just depends on MySQL and ElasticSearch the whole configuration for this should be no more than a 10 lines file. But being simple should still allow it to have all the orchestration capabilities of a more complicated App. It may be simple buy it may still be critical; It may still need to evolve fast.
This might be a transitory phase. As these projects mature and evolve higher abstractions may appear to hide away the details. But for the moment when you go the way of the container; When you have chosen the static route, there are few options that are not extremely costly to avoid infrastructure rot. Updating the stuff stays hard.
The zeitgeist is going static
It is almost a funny thing, to see this circular trend between “Dynamic” and “Static” be it types, compilation or tests… and now infrastructures and deployments. And the pendulum over the last two years or so is clearly swaying back to static; Ruby was all the rage just a small decade ago.
How many projects have you seen lately whose main value proposition seems to be “The same thing, reimplemented in Go, as a single static binary”?
Now, don’t get me wrong, if at its start I had little consideration for the design goals of the language, I have grown much warmer to it; its “no-nonsense” approach coupled with the very lively community around it have produced many a great project.
And here at Platform.sh, though most of our code is in Python, we write all of our high throughput concurrent servers in Go, and we are very happy with it; It’s more than “concurrency for idiots”: many small language features make quickly writing robust network servers easier than in most other languages (probably not Erlang or Elixir, but again deploying and managing those does demand so much more know-how).
So this is not a rant against Go. Not one against static typing, or static tests, not even static infrastructures. In the words of the GoLang FAQ “… although Go has static types the language attempts to make types feel lighter weight than in typical OO languages. “
Static typing is great because you really want the error to pop before you deploy the code and run it. Static tests are great because they are faster, and you can never really know what paths a running program will take. And statically linked binaries are great because easier deployments which also means faster adoption when you are doing Bleeding-Edge Github style open source. Static Immutable infrastructures are great because do you really want these kind of things to be where your bugs get expressed? And Static Containers can also be great. As long as you remember that “All this has happened before, and all of it will happen again.”
So .. if you are going the container way I really believe you should think of the following:
- Going descriptive versus script oriented orchestration. Name things give them a role.
- Colocating everything that is a dependency; If you need it in order to run the app, its part of the app.
- Going Micro-Container. Watching every ounce of overhead. Watching every cache that gets busted. Deduplicating everything that can be.
- Building everything. Building all the time.
- Knowing an application is a graph. And its an ordered one. All dependencies must be managed all operations on a cluster must be ordered.
- Versioning everything.
And you’ll be fine. By the way, we are.
If you find this compelling.. check out how Platform.sh supplants the paradigm of dev-staging-prod with one allowing you to clone an entire production cluster in a matter of seconds, creating a live ad-hoc environment for every git branch.