Notes on comments.
Welcome to our blog dedicated to the engineering of Microsoft Windows 7
Hi Jon DeVaan here.
Steven wrote about how we organize the engineering team on Windows which is a very important element of how work is done. Another important part is how we organize the engineering project itself.
I’d like to start with a couple of quick notes. First is that Steven reads and writes about ten times faster than I do, so don’t be too surprised if you see about that distribution of words between the two of us here. (Be assured that between us I am the deep thinker :-). Or maybe I am just jealous.) Second is that we want do want to keep sharing the “how we build Windows 7” topics since that gives us a shared context for when we dive into feature discussion as we get closer to the PDC and WinHEC. We want to discuss how we are engineering Windows 7 including the lessons learned from Longhorn/Vista. All of these realities go into our decision making on Windows 7.
OK, on to the tawdry bits.
Steven linked last time to the book Microsoft Secrets, which is an excellent analysis of what I like to call version two of the Microsoft Engineering System. (Version one involved index cards and “floppy net” and you really don’t want to hear about it.) Version two served Microsoft very well for far longer than anyone anticipated, but learning from Windows XP, the truly different security environment that emerged at that time and from Longhorn/Vista, it became clear that it was time for another generational transformation in how we approach engineering our products.
The lessons from XP revolve around the changed security landscape in our industry. You can learn about how we put our learning into action by looking at the Security Development Lifecycle, which is the set of engineering practices recommended by Microsoft to develop more secure software. We use these practices internally to engineer Windows.
The comments on this blog show that the quality of a complete system contains many different attributes, each of varying importance to different people, and that people have a wide range of opinions about Vista’s overall quality. I spend a lot of time on core reliability of the OS and in studying the telemetry we collect from real users (only if they opt-in to the Customer Experience Improvement Program) I know that Vista SP1 is just as reliable as XP overall and more reliable in some important ways. The telemetry guided us on what to address in SP1. I was glad to see one way pointed out by people commenting about sleep and resume working better in Vista. I am also excited by the prospect of continuing our efforts (we are) using the telemetry to drive Vista to be the most reliable version of Windows ever. I add to the list of Vista’s qualities successfully cutting security vulnerabilities by just under half compared to XP. This blog is about Windows 7, but you should know that we are working on Windows 7 with a deep understanding of the performance of Windows Vista in the real world.
In the most important ways, people who have emailed and commented have highlighted opportunities for us to improve the Windows engineering system. Performance, reliability, compatibility, and failing to deliver on new technology promises are popular themes in the comments. One of the best ways we can address these is by better day-to-day management of the engineering of the Windows 7 code base—or the daily build quality. We have taken many concrete steps to improve how we manage the project so that we do much better on this dimension.
I hope you are reading this and going, “Well, duh!” but my experience with software projects of all sizes and in many organizations tells me this is not as obvious or easily attainable as we wish.
Daily Build Quality
Daily quality matters a great deal in a software project because every day you make decisions based on your best understanding of how much work is left. When the average daily build has low quality, it is impossible to know how much work is left, and you make a lot of bad engineering decisions. As the number of contributing engineers increases (because we want to do more), the importance of daily quality rises rapidly because the integration burden increases according to the probability of any single programmer’s error. This problem is more than just not knowing what the number of bugs in the product is. If that were all the trouble caused then at least each developer would have their fate in their own hands. The much more insidious side-effect is when developers lack the confidence to integrate all of the daily changes into their personal work. When this happens there are many bugs, incompatibilities, and other issues that we can’t know because the code changes have never been brought together on any machine.
I’ve prepared a graph to illustrate the phenomenon using a simple formula predicting the build breaks caused by a 1 in 100 error rate on the part of individual programmers over a spectrum of group sizes (blue line). A one percent error rate is good. If one used a typical rate it would be a little worse than that. I’ve included two other lines showing the build break probability if we cut the average individual error rate by half (red line) and by a tenth (green line). You can see that mechanisms that improve the daily quality of each engineer impacts the overall daily build quality by quite a large amount.
For a team the size of Windows, it is quite a feat for the daily builds to be reliable.
Our improvement in Windows 7 leveraged a big improvement in the Vista engineering system, an investment in a common test automation infrastructure across all the feature teams of Windows. (You will see here that there is an inevitable link between the engineering processes themselves and the organization of the team, a link many people don’t recognize.) Using this infrastructure, we can verify the code changes supplied by every feature team before they are merged into the daily build. Inside of the feature team this infrastructure can be used to verify the code changes of all of the programmers every day. You can see in the chart how the average of 40 programmers per feature team balances the build break probability so that inside of a feature team the build breaks relatively infrequently.
For Windows 7 we have largely succeeded at keeping the build at a high level of quality every day. While we have occasional breaks as we integrate the work of all the developers, the automation allows us to find and repair any issues and issue a high quality build virtually every day. I have been using Windows 7 for my daily life since the start of the project with relatively few difficulties. (I know many folks are anxious to join me in using Windows 7 builds every day—hang in there!)
For fun I’ve included a couple pictures from our build lab where builds and verification tests for servers and clients are running 24x7:
Whew! That seems like a wind sprint through a deep topic that I spend a lot of time on, but I hope you found it interesting. I hope you start to get the idea that we have been very holistic in thinking through new ways of working and improvements to how we engineer Windows through this example. The ultimate test of our thinking will be the quality of product itself. What is your point of view on this important software engineering issue?
My last comment to this blog was asking for more details of "How do you guys make Windows?"
What did I get? Oh yes.
Looking at the numbers and using it to decide the size of teams is exactly the type of thinking Bill Gates outlined a decade ago in his book, "Business @ the Speed of Thought". It's nice to see smart work not just done, but how it was done.
Now I just have another request...
Nice post....btw I'd like to test win 7, pls finish it soon.
Very interesting post.
Me likes fancy pictures in posts, hoping to see some more of them! Let us see some of your guys work already! :)
Its nice to get some insight in how you work and handle stuff internally.. My only request is telling us more about Windows 7 also, as that is what this blog is about, is it not? ;)
Would it be possible to get some insight into what features are actually being considered for Windows 7 at this stage of development, or are we going to be waiting for already planned conferences?
Er... Where's the beef?
How about giving us some real info about Windows 7? All I've seen so far are rationalizations and org structure.
My fellow commentors... this is "Engineering Windows 7" not "Nifty New Features of Windows 7"
A huge portion of Engineering something is planning. A huge portion of planning is understanding what pices to add and not add, and understanding how all the pieces will fit together to make something coherent.
BTW they really need to add speel check to IE8 :P
big company = big money available for development = many servers
when I look into last "big" releases (Vista, IE8 now), which need a lot of resources, I'm asking: maybe you should use worse hardware for development ? (it could force better code optimization too)
and what about many questions about separating applications, making Registry less vulnerable for garbages after uninstalled applications, etc. etc. Could you at least write, if you work on it ? Or do you plan rather Vista SE ?
This kind of information is very interesting. :D I also happy from what I get after reading this post that the project isn't facing the same kind of problems it did when building Longhorn (what's the primary reason for this? Is it because of the better test automation infrastructure or because Vista is already modular/MS figured out all the interdependency issues?) Hope this will create a much stable OS at RTM.
P.S. Wish the pictures were a bit larger.
Great post, but I suspect you had graphs ike this for Vista and look how poor the quality was in the real world. I udnerstand that a LOT of the problems had to do with bad drivers, old programs and the like. All true but to a user irrelevant! Look at how hibernate just didn't work. Look at how the Intel 945GM driver didn't work. A common funciton and a mainstream peripheral and they didn't work, yet I bet your graphcs were great.
- Get more hardware diversity inside Microsoft.
- Make EVERY person in Microsoft use the OS early, encourage them to be honest and listen to their feedback.
- Make your "private" betas include more folks and especially folks outside of IT and big companies.
- Bad performance, in any and every subsystem, is as serious as a problem that creates a BSOD. If the system is annoyingly sluggish it is to me the same as a BSOD. I'd rather have to reboot one a week but have a blindingly fast system than have one where it takes me a week to get anything done but never crashes. Before the launch of Vista, I bet most would have disagreed with me about this. After using Vista for about a month, I bet few owuld have disagreed.
In your blog post you mention the lessons learned from Vista. I think it would be great if you could share some of those lessons in a future blog post.
Also, I'm not sure that I completely follow this post. Are you saying that using your common test automation infrastructure you were able to get bugs/developer down to .1%? Or are you simply saying that you're able to get fewer bugs using the common test automation infrastructure and thus have an almost exponetially lower build break?
BTW. Thanks for this blog.
mikefarinha1: Using the "version 2" methodology essentially required that engineers be 100% perfect - the old methodology didn't scale beyond a hundred or so engineers.
Mark Lukovsky wrote about some of this conundrum in a presentation he did for Usenix a couple of years ago: http://www.usenix.org/events/usenix-win2000/invitedtalks/lucovsky_html/
By structuring the project the way it has been organized, the day-to-day quality of the project is kept extremely high - mistakes tend to be isolated so that the disruption only effects a small number of developers.
Like Jon I've been running daily builds for months now with essentially no disruption - that's because of the incredible amount of engineering (really) work that went into the organization and build system.
The process allows me to know that I can pretty much pick any daily build and install it on my primary development machine without fear of losing my data.
Hm, a few thoughts that came into my mind while reading this post.
How do you handle code merging? Is there some all-knowing source control team, or every feature team is trusted to only do sane things?
What software do you use for source control?
What about debugging? Profiling? I know this isn't the "diving into the dev. life of W7" blog, but what about using really new technologies (C#, .NET 4, virtualization)?
When I think engineering, I think about inventing new things. New structures, new algorithm, new methods. What are the things you had to invent due to the lack of them?
Also, as W7 is not just a software, but it's an OS, the kernel team must polish it's APIs first, then the Windows Platform team has to embrace the changes, haven't they? How do you handle this seemingly never-ending process/flow? (Linux distributions do this by totally separating them, and publishing updates rather frequently.)
Backward compatibility? Any chance, that you'll have the courage (as in management approval) to cut all the nasty tentacles reaching from the past?
I hope they'll be able to organize the implementation of support for animated gif's in the picture viewer again.
Windows Picture and Fax viewer was close to perfect, and Vista's taking away support for animated Gif's was a real shot to us artists : (
If you are after true over all system stability, You and the team might want to spend some time with ATI to find out why their drivers regularilly crash under Vista.
It is great that Vista can restart the video drivers (most of the time) unlike XP, but I think this might actually be counter-productive - ATI might be more proactive if their drivers hard crashed systems every time.
So ya, nice that you are working hard in house, but you should spend time with a few of the core hardware vendors as well. True, you can't spend time with them all, but there are really only 2 video cards of note nowadays - ATI and NVidea, and they definitely seem to need the help (for those who don't know what I am talking about, search for the words : crash atikmdag)