I’ve been in the consulting business more than a decade. I’ve designed some really cool applications. I’ve designed the infrastructure for some big applications. What amazes me the most is how infrequently people design systems.
Case in point, I’m working on an enterprise document management system for my current customer. I was brought in to the project very late, with only a handful of iterations left, and most of the application and infrastructure design completed. My job was to come up with the deployment plan to take the app from dev, into the test environment, and finally into production (I know, it’s not Agile, but it’s how the customer does things). Everything was all well and good until I started looking at the disaster recovery plan. It was obvious that the application architects and the infrastructure architects were not in the same room when designing the DR plan. Interestingly, neither side did anything wrong. In fact, both the app and the infrastructure were designed with best practices in mind. The only problem was that they wouldn’t work together.
Part of the disaster recovery plan called for replicating the SQL databases from the production environment to the disaster recovery environment using SAN replication. Good idea. Keep the replication down at the hardware level, and keep the DR boxes cold until needed. The app doesn’t need 5 9s uptime. It’s ok if it takes an hour or two bring up the app in DR. The infrastructure architects assumed that the app would just have to change the connection string to point to the new SQL servers in DR in case of a catastrophic failure of the production environment. Sounds logical, but they didn’t tell the application architect.
The application architect wanted a central repository for all the configuration information. Since this is a distributed app with a load-balanced front end, putting the configuration information in a SQL DB made sense. Each of the front-end boxes just needs a connection string to the common configuration SQL DB. Once again, good idea.
Where the whole thing breaks down, is that the app stores other SQL connection strings in the configuration DB. These connection strings may be to other DBs on the same server. Now think about this for a minute. SAN replication is replicating a DB that contains SQL connection strings. In a disaster situation, the SQL server that the connection strings in the table point to may not be on-line anymore. Can you see a problem with that?
Tomorrow, I’ll talk about possible solutions to the above problem, and how we solved it for the customer.
Cheers,
m²