It’s been interesting, and frustrating, watching the drama around healthcare.gov play out. Here in the real Washington, we’ve got a state-based exchange and while the first day or two was pretty ugly, it’s humming along pretty well now. Clearly things aren’t so rosy for the feds.
Lots of hand-wringing has been applied to the “how this could happen” question. To anybody who has built complex transactional systems, it’s pretty obvious that we’re simply seeing the consequences of an unfortunate gap between unit testing in isolation and integration testing at scale.
Systems like healthcare.gov involve a ton of different subparts: some manage your user account, others check eligibility with various agencies, still more crunch all your data to help search and filter and make plan recommendations, and so on.
All of these pieces ultimately have to work together, but trying to build them in one big clump is basically impossible --- it’s just too much going on all at once. So good engineers break big problems down into little ones and build each part separately. They also unit test each part on its own to make sure it behaves the way it’s supposed to.
Once the pieces are all built and have been tested in isolation --- you plug them together into a complete system. Theoretically, everything just snaps into place and you’re done. Of course, it never works out that way. Sometimes this is because the “rules” of how the pieces should snap together weren’t well thought through. Other times it’s because components share “dependencies” that nobody thought about.
Take this simple example. Team A builds a module and it uses a certain amount of computer memory, say two gigabytes (2gb). Team B builds another module that also uses 2gb. Your computers have 3gb memory total. Each module might work great on its own, but what happens when you put them together on the same computer? Together they break the system, because the combined memory load is too great.
And of course, it’s not usually remotely this clear cut. Maybe both modules only use 1gb under normal circumstances, so even the integration test succeeds. But when lots of requests start coming it at once, Module A starts using up more space, up to 3gb. The result is the same --- your site stops working --- but it only happens when both modules are put together and you run tests simulating a very high load on the system. The more complicated the interactions become, the harder it is to find the bugs in a test environment.
Things just get worse when the teams building each individual part get further and further away from each other. Remember when outsourced parts didn’t fit together on the Boeing 787? Same problem.
And wait --- we’re not done yet. Because you can’t even do real integration testing until everything is ready to put together, you’re always feeling squeezed for time when you get there, and maybe you rush a little, or a lot – and maybe everybody just hopes it’ll be OK, because there’s a deadline with political consequences.
So that’s why the site didn’t work. Yes, it was avoidable and some folks probably should lose their jobs --- but it’s also not a shocker --- this happens all the time. And it’ll get fixed, too --- in a couple of months, the exchanges will be working fine and most people will have forgotten about the annoyance. Do you really think healthcare.gov is more complicated than systems run reliably by Medicare or the IRS or frankly the NSA? No. Freaking. Way.
Which leads me to my last point for this post --- the idea that these website problems have any bearing at all on the viability of Obamacare is so unbelievably stupid I can’t believe that anybody spends one second seriously even talking about it. It’s like claiming representative democracy is a failed experiment because sometimes there’s a long line to vote.
Sean, good analogy, Many in business, healthcare and government think that any one that can hack code can design enterprise systems, Not. You need software engineers/ computer scientist to architect these enterprise systems. Healthcare has going through a huge paradigm shift as they move from desktop PC to mainframe EHR and enterprise systems. Systems that most are mission critical, where lives are at stake. If Facebook or Amazon goes down, money may be lost but not lives. We must rethink the type of skills needed to design and support healthcare software systems.
The states are contracting with statewide organizations who sub- contract to local organizations. This is too disjointed and waters down getting the word out and the work done. ACA clients will need ongoing help making decisions about providers and claims problems which may be too much for third level contractors to handle. CMS should arrange for Obamacare application helpers to work in all 1300 SSA offices. SSA has lost 10% of staff in last 3 years. There now between 4 and 8 empty work stations in each SSA office. They total 6,000 to 10,000, altogether they are worth up to $200 million, and they are unused due to staffing losses. If not used by Obamacare, the government is wasting about $1billion over next 5 years. This would greatly simplify national PSA'a - just tell citizens to visit their local SSA office. ACA navigators should use them to reach the public. When Medicare first started, SSA offices had to be open at night and on the weekends to get everyone enrolled. We must be successful in the roll-out of customer services for Affordable Care Act. Web Site and 800# are not enough. I would not buy a car or a house that way. Many citizens need face-to-face customer service. This plan can be applied to other federal agencies and we could add a second shift of white collar workers, see whitecollargreenspace.blogspot.com or Contact firstname.lastname@example.org or Tim at 989-701-8813
Here's the problem: This happens. All. The. Time. And no one has EVER learned anything from it.
Those that say there will be problems are ignored as "naysayers" or "pessimists," until it is too late to do anything.
The website problems don't have any bearing at all on the viability of the ACA -- but if they continue to be glossed-over, they will. Eventually, the "delay" folks will win the day.
Everyone wants to be an optimist, but when that optimism causes a colossal, systemic failure, no one is willing to take the blame for such a stupid outlook in an industry where you should ALWAYS be pessimistic. If you don't plan for the worst, you will lose EVERYTHING.
"Our servers can handle the load." No, your servers are going to catch fire, and you need to plan for excess capacity. You must be able to spin up a dozen additional VMs ASAP.
"Our integration testing is on track." No, you're going to have a catastrophic failure during testing because each time was given slightly different specs, and you need to plan for overtime to handle the last-minute problems. And not the B-team twelve time zones away, you need people on-site and out of bed.
"Our hardware has been triple-checked." No, the roof is going to fall in and destroy three racks, and you need to have a data recovery plan in place.