Notes on comments.
Welcome to our blog dedicated to the engineering of Microsoft Windows 7
We’re busy going through tons of telemetry from the many people that have downloaded and installed the Windows 7 beta around the world. We’re super excited to see the excitement around kicking the tires. Since most folks on the beta are well-versed in the hardware they use and very tuned into the choices they make, we’ve received a few questions about the Windows Experience Index (WEI) in Windows 7 and how that has been changed and improved in Windows 7 to take into account new hardware available for each of the major classes in the metric. In this post Michael Fortin returns to dive into the engineering details of the WEI.
The WEI was introduced in Windows Vista to provide one means across PCs to measure the relative performance of key hardware components. Like any index or benchmark, it is best used as a relative measure and should not be used to compare one measure to another. Unlike many other measures, the WEI merely measures the relative capability of components. The WEI only runs for a short time and does not measure the interactions of components under a software load, but rather characteristics or your hardware. As such it does not (nor cannot) measure how a system will perform under the your own usage scenarios. Thus the WEI does not measure performance of a system, but merely the relative hardware capabilities when running Windows 7.
We do want to caution folks in trying to generalize an “absolute” WEI as necessary for a given individual. We each have different tolerances or more importantly expectations for how a PC should perform and the same WEI might mean very different things to different individuals. To personalize this, I do about 90% of my work on a PC with a WEI of 2.0, primarily driven by the relatively low score for the gaming graphics component on my very low cost laptop. I run Outlook (with ~2GB of email), Internet Explorer (with a dozen tabs), Excel (with longs list of people on the development team), PowerPoint, Messenger (with video), and often I am running one of several LOB applications written in .NET. I feel with this type of workload and a PC with Windows 7 and that WEI my own brain and fingers continues to be my “bottleneck”. At the other end of the spectrum is my holiday gift machine which is a 25” all-in-one with a WEI of 5.1 (though still limited by gaming graphics, with subscores of 7.2, 7.2, 6.2, 5.1, 5.9). This machine runs Windows 7 64-bit and I definitely don’t keep it very busy even though I run MediaCenter in a window all the time, have a bunch of desktop gadgets, and run the PC as our print server (I use about 25% of available RAM and the CPU almost never gets above 10%).
The overall Windows Experience Index (WEI) is defined to be the lowest of the five top-level WEI subscores, where each subscore is computed using a set of rules and a suite of system assessment tests. The five areas scored in Windows 7 are the same as they were in Vista and include:
Though the scoring areas are the same, the ranges have changed. In Vista, the WEI scores ranged from 1.0 to 5.9. In Windows 7, the range has been extended upward to 7.9. The scoring rules for devices have also changed from Vista to reflect experience and feedback comparing closely rated devices with differing quality of actual use (i.e. to make the rating more indicative of actual use.) We know during the beta some folks have noticed that the score changed (relative to Vista) for one or more components in their system and this tuning, which we will describe here, is responsible for the change.
For a given score range, we hope our customers will be able to utilize some general guidelines to help understand the experiences a particular PC can be expected to deliver well, relatively speaking. These Vista-era general guidelines for systems in the 1.0, 2.0, 3.0, 4.0 and 5.0 ranges still apply to Windows 7. But, as noted above, Windows 7 has added levels 6.0 and 7.0; meaning 7.9 is the maximum score possible. These new levels were designed to capture the rather substantial improvements we are seeing in key technologies as they enter the mainstream, such as solid state disks, multi-core processors, and higher end graphics adapters. Additionally, the amount of memory in a system is a determining factor.
For these new levels, we’re working to add guidelines for each level. As an example for gaming users, we expect systems with gaming graphics scores in the 6.0 to 6.9 range to support DX10 graphics and deliver good frames rates at typical screen resolutions (like 40-50 frames per second at 1280x1024). In the range of 7.0 to 7.9, we would expect higher frame rates at even higher screen resolutions. Obviously, the specifics of each game have much to do with this and the WEI scores are also meant to help game developers decide how best to scale their experience on a given system. Graphics is an area where there is both the widest variety of scores readily available in hardwaren and also the widest breadth of expectations. The extremes at which CAD, HD video, photography, and gamers push graphics compared to the average business user or a consumer (doing many of these same things as an avocation rather than vocation) is significant.
Of course, adding new levels doesn’t explain why a Vista system or component that used to score 4.0 or higher is now obtaining a score of 2.9. In most cases, large score drops will be due to the addition of some new disk tests in Windows 7 as that is where we’ve seen both interesting real world learning and substantial changes in the hardware landscape.
With respect to disk scores, as discussed in our recent post on Windows Performance, we’ve been developing a comprehensive performance feedback loop for quite some time. With that loop, we’ve been able to capture thousands of detailed traces covering periods of time where the computer’s current user indicated an application, or Windows, was experiencing severe responsiveness problems. In analyzing these traces we saw a connection to disk I/O and we often found typical 4KB disk reads to take longer than expected, much, much longer in fact (10x to 30x). Instead of taking 10s of milliseconds to complete, we’d often find sequences where individual disk reads took many hundreds of milliseconds to finish. When sequences of these accumulate, higher level application responsiveness can suffer dramatically.
With the problem recognized, we synthesized many of the I/O sequences and undertook a large study on many, many disk drives, including solid state drives. While we did find a good number of drives to be excellent, we unfortunately also found many to have significant challenges under this type of load, which based on telemetry is rather common. In particular, we found the first generation of solid state drives to be broadly challenged when confronted with these commonly seen client I/O sequences.
An example problematic sequence consists of a series of sequential and random I/Os intermixed with one or more flushes. During these sequences, many of the random writes complete in unrealistically short periods of time (say 500 microseconds). Very short I/O completion times indicate caching; the actual work of moving the bits to spinning media, or to flash cells, is postponed. After a period of returning success very quickly, a backlog of deferred work is built up. What happens next is different from drive to drive. Some drives continue to consistently respond to reads as expected, no matter the earlier issued and postponed writes/flushes, which yields good performance and no perceived problems for the person using the PC. Some drives, however, reads are often held off for very lengthy periods as the drives apparently attempt to clear their backlog of work and this results in a perceived “blocking” state or almost a “locked system”. To validate this, on some systems, we replaced poor performing disks with known good disks and observed dramatically improved performance. In a few cases, updating the drive’s firmware was sufficient to very noticeably improve responsiveness.
To reflect this real world learning, in the Windows 7 Beta code, we have capped scores for drives which appear to exhibit the problematic behavior (during the scoring) and are using our feedback system to send back information to us to further evaluate these results. Scores of 1.9, 2.0, 2.9 and 3.0 for the system disk are possible because of our current capping rules. Internally, we feel confident in the beta disk assessment and these caps based on the data we have observed so far. Of course, we expect to learn from data coming from the broader beta population and from feedback and conversations we have with drive manufacturers.
For those obtaining low disk scores but are otherwise satisfied with the performance, we aren’t recommending any action (Of course the WEI is not a tool to recommend hardware changes of any kind). It is entirely possible that the sequence of I/Os being issued for your common workload and applications isn’t encountering the issues we are noting. As we’ve said, the WEI is a metric but only you can apply that metric to your computing needs.
Earlier, I made note of the fact that our new levels, 6 and 7, were added to recognize the improved experiences one might have with newer hardware, particularly SSDs, graphics adapters, and multi-core processors. With respect to SSDs, the focus of the newer tests is on random I/O rates and their avoidance of the long latency issues noted above. As a note, the tests don’t specifically check to see if the underlying storage device is an SSD or not. We run them no matter the device type and any device capable of sustaining very high random I/O rates will score well.
For graphics adapters, both DX9 and DX10 assessments can be run now. In Vista, the tests were specific to DX9. To obtain scores in the 6 or 7 ranges, a graphics adapter must obtain very good performance scores, support DX10 and the driver must be a WDDM 1.1 driver (which you might have noticed are being downloaded in beta during the Windows 7 beta). For WDDM 1.0 drivers, only the DX9 assessments will be run, thus capping the overall score at 5.9.
For multi-core processors, both single threaded and multi-threaded scenarios are run. With levels 6 and 7, we aim to indicate that these systems will be rarely CPU bound for typical use and quite suitable for demanding processing tasks and multi-tasking. As examples, we anticipate many quad core processors will be able to score in the high 6 to low 7 ranges, and 8 core systems to be able to approach 7.9. The scoring has taken into account the very latest micro-processors available.
For many key hardware partners, we’ve of course made available additional details on the changes and why they were made. We continue to actively work with them to incorporate appropriate feedback.
regarding the disk score, a lot of us in the beta program are seeing low scores, i know mine was 1.9. then i and lot of others, turned of disk caching in device manger and our scores all improved. mine went from 1.9 to 5.9.
now, you also mention that it tests the primary drive. is this the drive that windows is installed on, or the actual primary on the controller? i know i had to turn of caching for all of the drives before my score improved. i tried the drive the os was on, didn't help, so i just turned caching off for all drives.
i have since turned caching back on.
I hardly thing that many of the Disk scores are accurate. I have an older 80GB 7200 RPM IDE hard disk with an 8MB buffer in my work PC, and in my primary home machine (on which I'm dual booting Vista/7) I have a 320 GB 7200 RPM SATA-II drive with a 16MB buffer. The IDE scores a 5.4 and the SATA a 2.9? I hardly believe that's accurate.
In addition, encoding and burning an hour-long avi to DVD-video using Nero Vision generally takes 35-40 minutes in Vista on my hardware, but in Windows 7 it was well over an hour and only about 2/3 done encoding. I could see the video was very "jerky" and hardly smooth in the preview window during the process under Win7. This could very easily be due to real disk issues.
My other test PC also scored a 2.9 with a SATA-II, 7200RPM disk, but it dual boots XP, so no comparison can be made there. I'm looking for another system with an IDE disk to test. I strongly suspect that it will "show" performance better than any of my SATA systems.
If I'm being perfectly honest, I have to say I find it difficult to take the WEI seriously. Its a case where it's oversimplified to the point of being useless.
From what I can gather, there's a lot of detailed measuring that actually goes into determining each subscore. But the problem is that the user, especially use power users who are the only ones who actually care in the first place, are simply not privy to the measurements in the first place.
It's hard to take a 6.9 seriously when you have absolutely no basis for how that number was reached. As a front end, simplified method of performance metrics, its a fine idea. But there absolutely needs to be a entry somewhere where we can see the details, and it would be nice if there was a whitepaper we could look up how these scores are calculated.
What makes a good disk score? Random writes, sequential writes, free space, access time? Who knows!
What makes a good 3d graphics score? Fill rate, memory size, memory speed, number of stream processors, etc? Who knows!
There's no reason to withhold this information from those who actually care to know it.
I appreciate all the work that has gone into Windows 7 and as much as Vista was bashed by everyone and their mother, Vista is a good operating system.
I like the idea of system scoring. I didn't realize Vista did this until I got my new computer home, it would be nice if it were advertised more and possibly even had a sticker on the box. I doubt that will happen as sellers of low end computers are in for some tough questions.
Something that would be great is if it were scored on a basis of 10. No one scales things on a basis of 5.9 or 7.9. While I understand there are technical reasons behind those numbers, it just makes it harder for lay people to understand. Even if it was rounded off, on the basis of 10, it would be a whole lot easier for someone buying a computer off the shelf to understand.
Very good points by other posters; using WEI as a reliable way to determine Software performance on a given machine is a mistake. Why doesn't my WEI go up when i shutoff memory hogging bloat features like aero?
I strongly disagree with the notion that disk scores should/may change between Vista and 7.
In any analysis exercise such as WEI, the method of oberving and measuring should remain constant otherwise the results become meaningless! Imagine if they changed the way unemployment statistics were calculated from one period to the next! This is exactly what is happening with WEI. Any unbiased analyst would suggest the same thing.
A WEI score should be permanent unless the actual hardware is improved e.g. by firmware improvements or new components installed.
I was particularly shocked that a brand new hard drive went from 5.0 in Vista to 3.0 in Windows 7. That's a 40% reduction in score. (Is Windows 7 40% worse than Vista?) A reasonable statistical deviation would be something like plus minus 5%.
Increasing the max score from 5.9 to 7.9 should take account of the new advances in hardware so there should be no need to "downgrade/cap" the score of older hardware. The WEI max should be raised further say to 8.9 so newer hardware scores relatively better than the old without degrading the scores of the old.
If these changes are made permanent in the final release, then as the WEI will become an unreliable source of data and I thing anyone who uses WEI will be disappointed.
Thanks for the feedback so far. We take it very seriously. Let me see if I can address a couple common points made thus far.
On the matter of write caching, we very much believe it is a mistake that the WinEI score improves dramatically when write caching is disabled. We don't recommend disabling write caching and are working to understand how best to prevent the scores from improving so dramatically simply by disabling the cache. We do know, tha write caching typcially helps best with large sequential reading sequences and that disabling the cache prevents the build up of background work that may later interfere w/ subsequent reads. In other words, with caching disable we don't see the very long IOs that result in our capping the score at a low level.
On the matter of transparency, it is indeed our plan to disclose in great detail how the scores are calculated, what the tests attempt to measure, why, and how they map to realistic scenarios and usage patterns.
For the disk assessments, the tests are run on the disk that has Windows 7 installed on it.
On the matter of keeping the tests consistent between operating system releases, we debated this internally and decided it was best to accept some changes to address positive, and sometimes negative, issues that impact realistic scenarios. In support of this decision, I'd like to point out we had a great deal of data in our hands highlighting some common performance issues with disks, including almost all of the early solid state disks as they hit the market. Given the WinEI tests were not sophisticated enough to catch the problem, it seemed wrong for us to continue to highlight the drives as being good, or very good, when in fact they were the root of many responsiveness issues.
I think part of the conclusion stems from the now-established practice of bragging about one's WEI score. Instead of listing the score, you might instead describe it in qualitative terms: "Your system performance indicates that you can expect the following experience: .... To improve the experience, here are some things you might consider changing: ...."
Of course, that removes the possibility of scoring software for purchase.
"Why doesn't my WEI go up when i shutoff memory hogging bloat features like aero?"
For the same reason the horsepower of your car doesn't go up when you rip out the back seats.
"In any analysis exercise such as WEI, the method of oberving and measuring should remain constant otherwise the results become meaningless!
Imagine if they changed the way unemployment statistics were calculated from one period to the next!"
Uh... the way unemployment is counted has been updated multiple times over the years.
"I was particularly shocked that a brand new hard drive went from 5.0 in Vista to 3.0 in Windows 7. That's a 40% reduction in score. (Is Windows 7 40% worse than Vista?) A reasonable statistical deviation would be something like plus minus 5%."
If you read the article, you'll see that the way scores are calculated has changed to be a more accurate reflection of the performance of the hardware. Vista calculations didnt take into account pathological behaviours that MS discovered after the testing algorithms were developed.
So basically, your HDD was always a 3.0, Vista just wasnt aware of it.
Maybe it would be sensible to upgrade the WEI in Vista via the next service pack so everything is consistent between the operating systems.
Why is it Windows 7 makes changes to MP3 files? Who died and left you as God or the IP Police?
I can't understand this mentality, and frankly, I don't think it's even legal, is it? How is an operating system granted rights to permanently alter files which I own and have legally made? If you truly respect Digital Rights, then you have no rights to alter files found on my network. They don't belong to you, nor may you automatically modify them.
I'd suggest you take a hard look at this idea before you get yourself into a class-action suit that'll once again, put you at gross odds with the great majority of potential users.
I don't condone this action taken by Windows 7, and I doubt many will.
"Given the WinEI tests were not sophisticated enough to catch the problem, it seemed wrong for us to continue to highlight the drives as being good, or very good, when in fact they were the root of many responsiveness issues."
I agree that it would be better in explaining some performance issues but this will inevitablely sacrifice some consistency in comparision between systems on Vista and 7.
However put like that, it is probably more useful to be able to find the bottle necks in the system with a more accurate score. It's a shame the Vista WinEI tests for hard drives had a 40% uncertainty in their analysis on some drives.
[And regarding unemployment - it was only meant to be a loose analogy i.e. "You can prove anything with statistics"]
@daved1948 - "Why is it Windows 7 makes changes to MP3 files?"
Because the user has the option enabled in WMP to update media files with information from the Internet?