There is one service design flaw that engineers repeat day after day, month after month, year after year. Scalability? Nope, though it’s popular. Security? Happens, but not that frequently. Serviceability? Getting warmer. Give up? Don’t care because you don’t work on services? You will, and when you do, you’ll care.
The pervasive service design flaw is incompatibility. Service neophytes might wonder how this could be. After all, a wonderful thing about services is that you own the machines and versions of software running on them. If there was ever a situation where you could finally escape legacy software and control your environment, it would be services. Right? Wrong. Dead wrong.
Production services are a mixed blessing.
If you’re unfamiliar with services, perhaps you’re familiar with Windows applications. In particular, you may remember what used to happen when you installed two applications, each with a different version of the same shared dynamic linked library (DLL). The applications would often get confused as to which version to load, causing one or both applications to fail.
This problem was compounded by poorly written installers (that is, nearly all installers), which would overwrite a new version of a DLL with an old version. Uninstalling applications was even worse since you needed to reference count shared DLLs to avoid removing them prematurely, which only the best installers did well (that is, almost none).
The net result was that installing a PC game or utility would break a completely unrelated productivity application, and vice-versa. Many DLLs were shared by multiple applications, so there might be only one install sequence that could make all the applications work, or perhaps no sequence would work if one of the applications used a new version of one DLL and an old version of another. We had a name for this situation: “DLL Hell.”
These days Windows prevents installers from overwriting shared DLLs and runs different versions of DLLs at the same time, giving each application the version it wants. Thank goodness we’ll never see DLL Hell again—right?
I’m only giving a brief overview of DLL Hell. Follow the Wikipedia link if you’re interested or a masochist.
Most services run on multiple machines—some run on thousands of machines. When you deploy a new version of a website or service to production, the update isn’t instantaneous. There will always be a time when some clients will be using the new version and some will be using the old version, unless you shut down your service during deployment (yuck!). What’s more, your website or service probably depends on other services that also independently update in a mixed way. Is this problem sounding familiar?
Unlike Windows, no one controls the World Wide Web operating system. There’s no one protecting old services you depend on from being overwritten or rolled back. There’s no one insisting that services run multiple versions side-by-side.
“Wait, what if you use a virtual IP address in front of a load balancer? You can take machines out of rotation, update them, and then put them back in.” Clever, but what about the first set of machines you put back? Unless you shut down your service and let all the connections drain, there will be several minutes when both the new and old versions of the service are in use. The same problem happens when using Azure’s DNS flip. So, what can you do?
Luckily, most modern services use versioning in the service contract to ensure the right version of interfaces is invoked and/or run multiple versions of services side by side using exposure control. If you use exposure control as I described in There's no place like production, it’s relatively easy to direct traffic to the right version of your service. Then you can deploy new versions of your service without worrying about breaking old clients. That is, if you forget about the data. Oh yeah, that’s right—the data.
Many services manipulate data—client data, application state, and shared settings. Naturally, this data is shared across all roles within a web service or application—it provides the context and persistence to allow clients to do meaningful work with a service. Purists will tell you that services shouldn’t carry application state and shared settings should never change. That’s nice, but not always practical. Even if it were, there’d still be client data.
Remember that production is a mixed blessing. There’s always a time when a mix of old and new versions of a service is running. That means new versions of a service will run against old data, and old versions will run against new data. Don’t see the problem yet?
Let’s say you update your data schema. Your old code doesn’t understand the new schema, so first you deploy your new code, which you’ve written to understand both the old and new schemas. (You’re so clever!) Next you update the data schema and let your service run, until you realize there’s a horrible bug in the new code. So you must roll back to the old code. Uh oh. The old code doesn’t work against the new schema. You can’t roll back the database because you’d lose data that changed since the update. Updating your data schema has resulted in either having a horrible bug in production or losing client data.
Let’s say you have a setting that disables an incomplete or malfunctioning service feature—such settings are typically shared and dynamic in order to kill features that unexpectedly misbehave. Your new code has a fix for the feature, so you’ve changed the setting to re-enable it. Your team has also added other settings for other new features. Naturally, you deploy the new settings first—otherwise the new code wouldn’t have its needed settings. (You’re so clever!) The old code grabs the new settings, re-enables the broken feature, and yowsa!
Many thanks to Alexey Kouzmitch, who broke down these scenarios for my team.
The issue of a setting change impacting old code is easy to fix—either always use new settings (basically version your settings), always deploy settings with roles (don’t make them shared), or always enable features as a post-deployment step.
The issue of updating your service data schema is slightly more complex. Schema changes must be forward and backward compatible—they work with the new code and work with the old code should a rollback become necessary. How do you do this? Either don’t make breaking changes to schemas (don’t edit/delete columns or stored procedures, only add them) or spread the schema change over two deployments.
If you must make breaking schema changes, first deploy new code that can handle the old and new schemas. Let it run a while to ensure it’s stable and functional. Later, install the schema changes in the next deployment. If that new deployment is faulty, a rollback will take you to the previous deployment, which works with the new schema.
You can handle service API changes in similar ways. Instead of versioning the service contract, you just add parameters to interfaces using backwardly compatible default values. If you must edit or delete parameters, you do so in a new interface and keep the old interface around. If you must make breaking changes to existing interfaces, you do so using two deployments, just like the case of the breaking schema change.
Even with the best of plans, there are times when a new service version can’t be compatible with the old version. An example of a recent disaster was when a third party made a breaking schema change to its service that also contained a security fix. We had to take the change right away, so naturally the service was down as soon as the schema was updated.
In cases where downtime is unavoidable, let operations and all other stakeholders know at least a week beforehand. That way, you can brainstorm ways to minimize the impact and properly communicate that impact to clients in advance.
Production is a mixed blessing. You can control nearly every aspect of production, but we live in a rapidly changing world, and those changes mean old and new software are frequently intermingled. Thus, your service changes need to be forward and backward compatible.
The good news is that service changes don’t need to be backward or forward compatible for more than two or three revisions, and there are straightforward mechanisms for ensuring your service continues running smoothly during new deployments. However, if your development and test teams only focus on the new version, your next deployment is likely to create widespread panic.
Don’t just be smart—be smarter. Think ahead and behind two deployments. Test strange configurations as necessary. Patiently deploy data changes. And always put the customer first. After all, when the customer isn’t first, neither are we.