Software Engineering is Still a Risky Enterprise

 

The problems in being able to produce and maintain software with predictable success were documented almost at the dawn of the profession, by Frederick Brooks, Jr., in his essay, “The Mythical Man-Month.”  The persistence of those problems, as well as the lack of established solutions was confirmed as recently as 2007 by Scott Rosenberg in his book, Dreaming in Code.  Both Brooks and Rosenberg emphasize that the science of computing runs aground in the process of software engineering because software engineering is a process accomplished by groups of people and the coordination of those people tends to be problematic. 

 

Perhaps an obstacle in the way of solving that problem is the fact that, of all people, computer programmers are prone to thinking that problems can be solved with software.  Thus, they tend to think of the maintainability of software as an intrinsic attribute thereof.  That is, they tend to think that one piece of software is more maintainable than another principally, or even solely, because of differences in how those pieces of software are written. 

 

However, if the production and maintenance of software is actually a social endeavor, then it is better accomplished when the social process of software production and maintenance is designed in addition to the software entities themselves.  In their book, Do You Matter? How Great Design will Make People Love Your Company, Robert Brunner and his co-authors document various successful efforts in the design of social processes incorporating engineering products.  They document the successful engineering, not just of a product, but of the entire process by which humans incorporate those products into their lives.  The achievements they document are not in the development of a wonderful software user interface for the product, but rather in the development of a way of living and working with the product that extends well beyond its user interface to the tasks of choosing the right model, ordering it, learning how to use it, getting it repaired, and so on. 

 

 

Other Professions Explicitly Design How Their Work Is Done

 

Indeed, in other enterprises, the design of the human processes incorporating the engineered product is often commonplace.  Consider the production of artwork.  That enterprise provides a particularly compelling example, because whereas the production of software is commonly assumed to be closely related to the science of computing and to involve great precision, the production of artwork is thought to depend on innate talent and inspiration, and would therefore, presumably, be less predictable. 

 

In fact, professional artists successfully impose predictability onto their work by devoting effort to the design of their activity separately from the design of their artwork.  The term artists use for thinking about their activity is workflow.  There are standard workflows that are known to be successful, and professional artists routinely consider and debate the virtues of those.  A workflow breaks an artist’s labor down into constituent activities, which makes it possible for them to estimate the duration of those activities for a given project and to calculate costs.  The scale of one project versus another can be computed by comparing the durations of the various activities for each project. 

 

Another example of the design of human processes to ensure maintainability and predictability can be observed whenever one takes one’s automobile in for maintenance or repair.  One can readily observe how the automobile is processed through a workflow that begins with obtaining information from the owner, proceeds through diagnosis, estimation, and repair, on to notification of completion and receipt of payment.  Again, the organization of this process is decidedly not accidental or variable from day to day.  It has been designed and optimized, and organizations compete on the basis of the quality of that design.  Indeed, a thesis of Brunner and his colleagues in Do You Matter? is that commercial organizations increasingly complete on the basis of the quality of the design of their human processes, their actual products having become interchangeable in terms of features, quality and price. 

 

 

Software Engineers Do Not Design How Their Work is Done

 

In software engineering, a commonly-accepted principle is that an iterative development process is to be preferred over a sequential one.  So today, the development of most non-trivial software proceeds by producing one working version after another, incorporating feedback on the previous one into the development of the next, and so on, throughout its useful life.  By implication, the software will always be undergoing modification. 

 

During these iterations, the software and the modifications to it are generally designed in advance, or at least, everyone would agree that should happen.  Also, the process of gathering and documenting requirements is planned in advance, and so are the testing procedures that the software will undergo. 

 

What usually is not explicitly designed, though, is the process by which the software developer will interact with the software during the iterative process of modifying it.  Before the software is built or changed, the developer may offer blueprints for review, but those almost never describe how the developer will be able to reproduce a bug that someone may find in the software after it has been completed, or how the developer will identify the cause, or later confirm that the defect is fixed without inadvertently creating another one.  Rather, it is invariably assumed that all of that will go smoothly provided the developer is competent.  That assumption is both faulty and risky, however. 

 

The assumption is faulty precisely because, as Brooks emphasized, software is usually not produced by capable individuals toiling alone, but rather by a number of people working together.  One developer may well make changes to a component of a software system that will disrupt how another developer has been accustomed to debugging that component.  For example, this writer has spent the last several years as part of large team developing a commercial product of which one component is a Windows service.  At the beginning of the project, it was possible to start the service under the Microsoft Visual Studio debugger.  Then someone made a change to the service by which it would never successfully start in that environment.  From then on, if one wished to attach a debugger to the service, that had to be done after the service was installed and running.  Consequently, debugging errors that occurred as the service was starting became very difficult.  This unfortunate side-effect of the change was not anticipated because the existing processes for developing the service were never documented or even discussed, and the preservation of those processes was never prioritized. 

 

Just assuming that developers will be able to readily figure out how to make changes to software just because they are qualified to work as programmers is risky for several reasons.  One is that the personnel on a project change over time so the person who initially developed some component is seldom the person who will later be required to maintain it.  Also, if only one or two developers know how to go about modifying some part of the system, then a release may be blocked on them resolving a number of defects while other developers are idle.  Most importantly, if a software system is developed iteratively over its lifetime of use, then the primary concern should be to ensure that the process for accomplishing that iterative development is documented and effective for every one of the components.  Yet that is precisely the process that is simply left to chance!  At the logical extreme, a piece of software with many significant flaws, but which one knows how to quickly and reliably fix, is decidedly more valuable than one that has just one or two significant flaws, but which is practically impossible to repair.  Whereas the former might well be released after a period of time that one could accurately estimate, the latter probably never would be. 

 

 

Simple, Actionable Recommendations

 

Once the importance of designing how software is to be maintained is recognized, the actions to take become readily apparent, and are quite straightforward.  First, before a developer is to initiate the construction of a software entity, or any significant alteration to it, the developer must provide not only a satisfactory design of the entity itself, but also an explicit and acceptable account of the process by which the software is to be maintained.  Second, in evaluating alternative designs for the software, the implications of those alternatives for the processes of maintaining the software must be taken into account.  Third, the cost of establishing and executing the maintenance processes must be factored into the overall estimation of the project. 

 

 

Explicit Designs of Maintenance Processes

 

In the server this writer’s team has been developing, input is received via HTTP and electronic mail, and the implications of every input have to calculated before a decision can be made about whether the input is permissible.  The algorithm for that calculation is complex, and is currently implemented using a number of Transact-SQL stored procedures.  There have probably been more defects found in the implementation of the algorithm than anywhere else in the system.  However, that state of affairs is significantly less worrisome that it would otherwise be, because there is an effective procedure for investigating the defects, testing the remedies, and guarding against regressions.  By virtue of there being a known and effective procedure for handling the problems, the time to fix each one can be accurately estimated, and that time is generally brief even though the code is complex.  Any reported defect can be described in an XML format that captures both the input that reportedly resulted in the defect, as well as the nature of the error itself.  The XML is used as input to a test driver, which proceeds up to the point at which the error is expected to occur.  At that point, it alerts the developer providing a numeric value that can be used as input to a second driver that executes the stored procedures, and displays the state of the calculation at every point.  Crucially, all of this is reliably idempotent, so the developer can repeat the process as many times as may be necessary to complete the investigation and validate the fix. 

 

Without this effective procedure for maintaining a critical and complex part of our server, the release schedule would have been jeopardized.  Yet, in preparation for developing this component, what was prepared and scrutinized were detailed designs of the algorithm and the implementation, and not the process for maintaining it.  However, those original designs of the algorithm are no longer accurate models of the actual code, while it the maintenance procedure has proven to be very important. 

 

The lesson to learn from this experience is plain.  Do not leave the existence of a sound maintenance procedure to chance. 

 

 

In Planning Changes to Software, Take the Effects on the Processes for Maintaining the Software into Account

 

Our server accepts queries in the form of XPath expressions.  Transact-SQL queries are generated from the XPath expressions, and the Transact-SQL queries are executed to retrieve the data matching the query from our data store. 

 

An XPath parser produces an object model representation of the inbound query, and a SQL generator yields the executable Transact-SQL version.  In the process of figuring out the optimal schema for our data store and the best way of structuring the complicated Transact-SQL queries, we interposed an XML representation of the query between the XPath parser and the SQL generator.  So, at that point, the XPath parser produced the object model representation of the query, which was then represented as XML and passed to the SQL generator.  That arrangement was useful because a library of XML documents representing various queries could be created then used as input for testing various versions of the SQL generator in isolation.  Several iterations of the SQL generator were being produced each week at that point, as we evaluated different schema and various query strategies. 

 

Once we settled on a schema and a way of formulating our SQL queries that yielded satisfactory performance for all of our diverse query test cases, we considered whether to remove the step of producing an XML representation of a query before generating the Transact-SQL version  Yielding the Transact-SQL directly from the object model representation would be faster, because it would eliminate a step. 

 

Due the complexity of our queries, the time to execute them generally dwarfs the time to translate it from XPath into Transact-SQL, regardless of whether we express the query as XML in between.  On the other hand, having the XML to isolate the SQL generator from the rest of the server had become crucial to our process for maintaining the SQL generator.  Whenever a defect was found in our query feature, it was invariably due to the Transact-SQL not returning the correct results, or not executing quickly enough, or being syntactically erroneous.  We’d produce the XML representation of the query that caused the error once, and execute the SQL generator by itself, using that XML as input, and examine the Transact-SQL that got produced in SQL Server Management Studio.  Once the error in the Transact-SQL was identified and the SQL generator modified to correct it, the generator would be tested in isolation against our library of XML representations of queries, which was now expanded to include the one for the most recent failure case. 

 

Prior to having the XML version of a query as input to the SQL generator, when the SQL generator took to the object model representation of the query as input, debugging the SQL generation was decidedly more difficult.  It required installing and executing the service and attaching a debugger to that, then sending in the query from a client application, and using the debugger attached to the service to examine the representation of the query in memory and step through the SQL generation code.  Adding to the difficulty and duration of this process was the fact that our standard client does not allow a user to send an XPath query to our server directly.  Instead, it provides a graphical user interface for building a query that the client translates into XPath for transmission to the server.  So, because we didn’t think to maintain a client dedicated to sending arbitrary XPath expressions at our service for the sake of debugging queries, whenever we had to debug a query, one of us had to build a client for transmitting the XPath query to the service by scavenging the code for that purpose from the code for the standard client. 

 

Thus, the answer to the question about whether we should improve performance by removing the step of producing an XML representation of a query in the process of translating it from XPath into SQL was a firm “no.”   If the question was considered purely in terms of the software, then the answer might well have been “yes,” because the software would have been simplified, perhaps, by the omission of a step, and its performance improved, albeit just a little.  It is when the issue is considered in terms of the process by which the software is maintained that the correct answer is readily apparent.  In this case, it is hard to justify the step of representing the query as XML in terms of the design of the solution, and especially difficult to justify it with respect to the functional requirements.  Yet, it is crucial for the human process of servicing the query generator. 

 

 

Estimate the Cost of Software Engineering Projects from the Costs of the Processes for Maintaining Each Component

 

Brooks maintained that a software development project ends up a year behind schedule one day at a time.  One might add that days are lost one hour at a time. 

 

Our project ended up a year behind schedule.  It is not difficult, in retrospect, to explain why all those additional hours that proved to be necessary were not anticipated at the outset. 

 

The developers spent time designing and building, over and over again, the tools they needed to fix the bugs.  The time to build those tools once, and properly, was an inevitable expense.  However, it was never correctly incorporated into the project’s estimates.  Furthermore, the overall time could have been reduced if the tools were designed explicitly and if they were shared and maintained and not accidentally broken. 

 

The time that ended up being absorbed by the steps involved in reproducing and fixing bugs was not fully taken into account in the estimates for several reasons.  One is that software developers tend to estimate based on attributes of the software they are to develop, such as lines of code and function points, rather than in terms of tasks that human beings perform, even though it is those tasks that take time.  Another reason is that, in the process of reviewing designs in anticipation of starting to code, no-one ever asked the developers if they had thought about how they would go about investigating and fixing bugs, validating their fixes, and guarding against regressions.  That is work that will inevitably have to be done on any software development project, but how that work will be done is not planned in advance or accurately budgeted. 

 

Because there is no physics by which the time absorbed by the concrete steps of maintaining software can be made to disappear, the manner in which the duration of software development projects is estimated has to change.  The estimation should be based on estimates of the time a person will take to execute each definite step in an explicitly designed process.  Once the estimation is done on that basis a number of benefits will result. 

 

First, management will insist on developers deliberately planning how they are going to do their work, in addition to designing their software.  So the process by which the software is to be maintained will no longer be left to chance. 

 

Second, estimation will become more accurate.  It will be more accurate because it will be based on what actually takes time, which is people doing things, rather than being based on lines of code or function points or some other attributes of the software to be built.  None of those attributes can properly be said to have a duration.  The estimation will also be more accurate, because once developers have a reliable process for maintaining a particular component, they will be able to base their estimates on the time that they know the steps of that process will consume.  For example, if a developer has a reliable process for reproducing any defect in a particular component, diagnosing the cause, validating the fix, and excluding regressions, then the developer will be able to accurately estimate the time to fix any defect.  Without such a process, any estimate will be suspect. 

 

Third, the wisdom or otherwise of a proposed change will be properly assessed in terms of its impact on the schedule, because its effect on the software maintenance process will be manifest rather than accidental.  What constitutes a regression will no longer just be a change that results in a new functional defect, but also one that makes it harder for the team to do its work, both of which are outcomes that can delay a schedule. 

 

 

Conclusion

 

Software engineering is thought to be a risky enterprise because large, unforeseen difficulties are often encountered.  Yet, serious unforeseen difficulties occur in every enterprise.  Moreover, software engineers take the prospect of those difficulties into account in exactly the same way that folk in other enterprises do, which is by adding contingency time to their estimates.  What tends to happen, though, is that the projects end up massively behind schedule no matter how much time was allocated for contingencies.  So it really isn’t the unforeseen difficulties that invalidate the estimates.  It is actually because the estimates are supposed to be of the time it will take for a person to complete a task, yet those tasks are seldom identified in advance by a plan of what will actually be done, and from which the durations of the tasks might be estimated.  Instead, the estimates are based on ideas the developers have about the software, which are captured in their designs. 

 

Thinking harder about software has not made software engineering a more reliable enterprise.  What might help is focusing on the steps in the process by which people build and maintain software.