I’m Christian Stockwell, a Program Manager on the IE team focused on browser performance.
Measuring the overall performance of websites and web browsers is important for users comparing the performance characteristics of competitive browsers, developers optimizing their websites for download times and responsiveness, browser vendors monitoring the performance implications of code changes, and everyone else who is generally interested in understanding website performance.
I thought it would be interesting to follow up on my previous posts on performance with a discussion around some of the issues impacting browser performance testing and the techniques that you can use to effectively measure browser performance.
A common way to approach browser performance testing is to focus on specific benchmarking suites. Although they can be useful metrics, it can be misleading to rely solely on a small number of targeted benchmarks to understand browser performance as users perceive it—we believe that the most accurate way to measure browser performance must include measuring real-world browsing scenarios. Measuring real sites captures factors that are difficult to isolate in other benchmarks and provides a holistic view of performance. Testing browsers against real-world sites does, however, introduce some key challenges and this post discusses some of the mitigations we’ve adopted to effectively measure IE performance as part of our development process.
Before delving too deeply into this post I wanted to say that effective performance benchmarking is surprisingly difficult. The IE team has invested a great deal of effort building a testing and performance lab in which hundreds of desktop and laptop computers run thousands of individual tests daily against a large set of servers, and our team rarely ends a day at work without a few new ideas for how we can improve the reliability, accuracy, or clarity of our performance data.
Part of the challenge in measuring browser performance is the vast number of different activities for which browsers are used. Every day users browse sites that cover the gamut from content heavy sites like Flickr to minimalist sites like Google. They may encounter interactive AJAX sites like Windows Live Hotmail or purely static HTML sites like Craigslist. Still others may use their browsers at work to use mission-critical business applications.
I expect that some of the approaches I discuss here will lend more context to the performance work we’ve done for IE8 and give you some insight into our engineering process. Above all, I hope that this post gives you ideas for improving how some of you measure and think about browser and site performance.
All browsers are inherently dependent on the network and any tests need to reflect that reality to adequately measure performance.
One aspect of the makeup of the internet that can impact browser performance measurement is how content is stored at various levels throughout the servers that comprise the internet. That storage is called caching.
With regards to browser performance measurement, what it means is that when you visit www.microsoft.com, your browser may request that content from several servers in turn—from your corporate proxy, from a local server, or from a broader set of international servers.
To improve browsing speeds and to distribute the work of serving web pages those servers may choose to temporarily store parts of the page you are navigating to so other users can get them faster. For example, if you get to work first thing in the morning and visit www.msnbc.com to quickly check the news you may request that page first from your corporate proxy server, which would then relay that request to a local server before finally getting the webpage from a server across the country. Once that page has been retrieved, your work’s proxy server or the local server may decide to store some of that content. When your friend Tracy in accounting comes in to work ten minutes later and tries to navigate to www.msnbc.com, she may get the content directly from your work proxy server instead of a server across the country—drastically reducing the time needed to navigate to the site and making Tracy very happy.
In a similar vein, when measuring the performance of several browsers it’s important that we consider the impact of caching. For example, if I were to open ten tabs to ten different websites in one browser, and then open the same ten tabs in a second browser I could wrongfully conclude that the second browser was faster when in fact the difference was due primarily to the content being stored by a nearby server when the first browser requested the pages.
It’s hard to rigorously control how servers may cache content but one general principle of performance measurement is to never only measure anything once. Unless you are specifically trying to measure the impact of upstream caching you should navigate to the sites you want to measure at least once before you start collecting any performance data. In fact, since proxies can cache content per user agent (browser), you should visit each site you intend to test against with every browser you will test.
My summary of caching behaviour is simplified. If you’d like more detailed information many great resources exist that describe the process in greater detail, including the HTTP protocol specification itself. The HTTP protocol spec also makes great nighttime reading and is a conversation starter at any party.
Precisely because there are so many external factors involved in browser performance the number of performance measurements you take can drastically change your conclusions.
I’ve mentioned that a general principle in performance measurement is to never measure anything only once. I’m going to expand that principle to “always measure everything enough times”. Many different schemes exist to determine what “enough times” means—using confidence intervals, standard deviations and other fun applications of statistics.
For a lot of the performance data we collect we often find that adopting a pragmatic approach and avoiding those relatively complex schemes is sufficient. In our lab we find that 7-10 repetitions is usually enough to collect a reliable set of data and identify trends, but you may find that more repetitions are needed if your environment is less controlled.
Once you’ve collected your performance data you will likely want to summarize your results to draw conclusions. Whether you use the arithmetic mean, harmonic mean, geometric mean, or some other method, you should be consistent and fully understand the ramifications of how you are summarizing your data.
For example, let’s look at the following data points collected by testing two browsers navigating to a single webpage:
In this contrived example it’s clear that how you summarize your data can change your interpretation of the data—whereas the arithmetic mean suggests that Browser B is faster than Browser A, both the geometric and harmonic means would lead you to the opposite conclusion.
Sharing your network with other users means that—seemingly without rhyme or reason—your web browser may suddenly take much longer to perform the same action.
One benefit of working for a very large company like Microsoft is that the large number of employees makes certain phenomena reliable and measurable. For example, measuring page download times over the course of the day it’s clear that most of Microsoft starts working in earnest between 8am and 9am, and leaves between 5pm and 6pm.
The reason that I can tell that is that most Microsoft employees are accessing the network fairly constantly over the course of the day. Whether we’re browsing MSDN, reading documents on sharepoint, or rigorously testing the latest xbox games, we’re all competing for bandwidth. That sharing means that if I measure browser performance at 6am I will reliably get more consistent results than if I measure browser performance at 9am, when the entire company is getting to work and starting to email away.
Given the wide variety of networking configurations available in different companies it’s hard to predict the impact of bandwidth competition. To avoid having it distort your results I suggest that you try to collect performance data outside of core business hours if you’re collecting performance data at work.
If you’re collecting performance data at home you can similarly be sharing bandwidth with your family or other people nearby. In those cases you could time your measurements during times when fewer people are likely to be browsing—during business hours, late at night, or very early in the morning.
Sharing resources across applications on your own machine can affect browser performance just as severely as competing for bandwidth.
This is particularly true when multiple applications rely on the same external applications or platforms. For example, some anti-virus products may integrate differently with various browsers—with unknown performance consequences.
Testing two browsers side-by-side can produce the most distorted set of results. For example, on the Windows platform there is a limit of ten outbound socket requests at any one time; additional requests are queued until connection requests succeed or fail. Testing two browsers side-by-side means that you are likely to run into that limit, which could result in one browser maintaining an unfair advantage by virtue of having started microseconds sooner.
I’ve offered up two simple examples and others certainly exist. Without going into far greater detail I think it’s clear that I advise against running multiple applications when trying to measure browser performance.
At a minimum you should take two steps to reduce the chance of interference from other applications:
Beyond interference from shared resources on your machine or on your network, your performance results can also be impacted by the internal behaviour of the servers you are visiting.
One of the overarching principles when taking performance measurements is to try to maintain a common state between your tests. For cache management that meant you should give upstream servers a chance to reach a known state before collecting performance data, whereas for the network it meant trying to conduct your tests in a consistent environment that reduces the impact from external sources.
For an example of the application design characteristics which may impact benchmarking, let’s take the example of an online banking application. For security reasons some banking applications only provide access to account information when appropriate credentials are provided. Assuming the benchmarking test is trying to compare two (or more) browsers at this online banking Web site, it’s important to ensure the application is in a consistent state for each browser. By design, most online banking applications will prevent a user from being logged in to two sessions at the same time – when one is logged in, the other is logged out. Failure to reset the Web application state before starting the test on the second browser could cause the server based application to take extra time to analyze the second request, close the first session and start a new one.
That setup and teardown process can impact benchmarking and is not limited to online banking applications, so you should try to remove it as a factor. More generally, you should understand how your sites behave before using them during performance testing.
In many fields there is the potential that the action of taking a measurement can change the thing that you are trying to measure—that phenomenon is called the Observer Effect.
You can use any of a number of frameworks to simplify the task of measuring specific browsing scenarios. These frameworks are typically aimed at developers or technical users. One example of such a framework is Jiffy.
As with any infrastructure that may directly impact the results you are trying to measure you should carefully assess and minimize the potential for introducing changes to the performance due to the framework you are using for measurement.
As an aside, the IE team uses the Event Tracing for Windows (ETW) logging infrastructure for our internal testing as it provides a highly-scalable logging infrastructure that allows us to minimize the potential for the Observer Effect to distort our results.
Just as with humans, no two machines are exactly alike.
As I mentioned above, within the IE performance lab we have a very large bank of machines that are running performance tests every hour of every day. To maximize our lab’s flexibility, early in IE8 we attempted to create a set of “identical” machines that could be used interchangeably to produce a consistent set of performance data. Those machines bore consecutive serials numbers and were from the same assembly line, and all their component parts were “identical”. Despite those efforts, however, the data we collected on that set of machines has been sufficiently varied that we avoid directly comparing performance results from two different machines.
It should come as no surprise, then, that I suggest that unless you want to study how browser performance varies across different platforms you should test all browsers on a single machine.
The amount of time it takes to start a browser can depend on many factors outside of the control of the browser.
As with caching, measuring the speed of browser startup is susceptible to outside factors—particularly the first time you start the browser. Before the browser can start navigating to websites it needs to load parts of itself into memory—a process that can take some time. The first time you start the browser, it is difficult to know exactly how much may already be loaded into memory. This is particularly true for IE since many of its components are shared with other programs.
To collect more consistent data, open and close each browser at least once before you start testing against them. If you have no other applications running that should give your operating system the opportunity to load the required components into memory and improve the consistency and accuracy of your results. It should also provide a fairer comparison between browsers, especially in light of features like Windows Superfetch that may otherwise favour your preferred browser.
Websites change constantly. Unfortunately, that also includes the time when you are attempting to test performance.
Within the IE team’s performance lab all website content is cached for the duration of our testing. One impact of that caching is that we can ensure that exactly the same content is delivered to the browser for each repetition of a test. In the real world, however, that is often not the case.
News sites, for example, may update their content as a story breaks. Visiting Facebook or MySpace twice may result in radically different experiences as your friends add new pictures or update their status. On many websites advertisements change continually, ensuring that any two visits to your favorite site are going to be different.
Outside of a lab environment it is hard to control that type of change. Approaches certainly exist, and you can use tools like Fiddler to manipulate the content your browser receives. Unfortunately, those approaches stand a very good chance of affecting any performance results. As a result, the pragmatic solution is to follow the advice I’ve outlined in my point on sample sizes above—and if you notice that a very heavy advertisement is appearing every few times you visit a page, I think it’s fair to repeat that measurement to get a consistent set of results.
Not only can websites change under you, but site authors may also have written drastically different versions of their website for different browsers.
One tricky spin on the problem of ensuring that websites serve up the same content for each of your tests are those sites that serve distinctly different code to different browsers. In most cases you should ignore these differences when measuring browser performance because those are valid representations of what users will experience when visiting different websites.
In some cases, however, websites can offer functionality that differs so widely between browsers that the cross-browser comparison is no longer valid. For example, I was recently investigating an issue where one of our customers was reporting that a website in IE8 was taking several times what is was in a competitive browser. After some investigation I discovered that the website was using a framework that provided much richer functionality in IE than in the other browser. Fortunately the website was not relying on any of that richer functionality so they were able to slightly modify how they were using the framework to make their site equally fast across browsers.
In that example the website was not using the extra functionality offered by their framework and they were able to update their site—but in many cases websites offer completely different user experiences depending on the browser. Assessing those websites is largely a case-by-case affair, but I typically consider those sites unsuitable for direct comparisons because their performance reflects the intentions of the site developers as much as the performance of browsers.
Identifying sites that differentiate between browsers is not simple, and in this case web developers generally have an upper hand on reviewers. Web developers should use profilers, debuggers, and other tools at their disposal to identify areas in which their websites may offer drastically different experiences across browsers.
Reviewers and less technical users should avoid measuring cross-browser performance on sites that clearly look and behave differently when you try to use them, since in those cases it is difficult to disentangle browser performance from website design.
Can you define what “a webpage is done loading” means? How about for a complex interactive AJAX site?
One surprisingly intractable issue in performance measurement is defining what “done” really means in terms of navigating to a webpage. The problems involved are compounded as websites grow increasingly complex and asynchronous. Some web developers have used the HTML “onload” event as an indicator of when the browser has finished navigating to a webpage. The definition of that event is, unfortunately, interpreted differently across different browsers.
Within the IE team we use some of our internal logging events to measure page loads across sites. Since that logging is IE-specific it does not, unfortunately, provide an easy cross-browser solution to measure page loading performance across browsers. And, although cross-browser frameworks like Jiffy and Episodes exist that can help site developers define when their scenarios are “Done”, those indicators are not yet widely consumable by users at large.
Beyond specific code-level indicators some people use browser progress indicators to assess when a page is finished downloading—hourglasses, blue donuts, progress bars, text boxes, and other UI conventions. These conventions, however, are not governed by any standards body and browser makers can independently change where and when (and if!) they are displayed.
Faced with those realities, the pragmatic approach I encourage reviewers and users to adopt is to use browser progress indicators while validating those indicators against actual webpage behaviour. For example, when you are testing how quickly a particular web page loads, try to interact with it while it is loading for the first time. If the webpage appears to be loaded and is interactive before the progress indicators complete then you may want to consider ignoring the indicators and using the page appearance for your measurements. Otherwise, the progress indicators may be enough for an initial assessment of how quickly a page is downloading across various browsers. Without validating that the actual page load corresponds closely to the browser indicators it is difficult to understand when they can be trusted for performance measurement.
Running an add-on when you are testing a browser means that you are no longer only testing browser performance.
As I discussed in my April post, add-ons can have a tremendous impact on the performance of a browser. In the data I receive through Microsoft’s data channels it is not uncommon for me to see browsers with dozens of add-ons installed and I suspect that my colleagues at the Mozilla corporation could say the same of their browser.
Any of those add-ons may be performing arbitrary activity within the browser. Illustrating that impact by way of an anecdote, I’ve noticed that some users with a preferred browser sometimes find any alternative browser faster simply because it comes with a clean slate. For example, a Firefox user with several add-ons installed could move to IE and observe enormous performance improvements while an IE user could migrate to Firefox and observe the same performance benefit. Those results are not contradictory, but rather reflect the significant impact of browser add-ons.
As a result, in our performance lab we test both clean browser installations as well as with the most common add-ons installed. To disable all add-ons in IE8, click on the “Tools” menu and select “Manage add-ons”. In the Manage Add-ons screen ensure that you’ve chosen to show “All Add-ons”, and disable each listed add-on. Alternatively, if you are comfortable with the command line you can run IE with add-ons disabled with the “iexplore.exe -extoff” command.
Since most browser makers go to great lengths to ensure that add-ons continue to work as expected when upgrading, taking the time to follow these steps is particularly important when evaluating new versions of browsers as any performance improvements may be hidden by a single misbehaving add-on.
I know this post has been quite long, but I hope that by covering a few of the techniques we use when measuring IE performance you will be able to adapt some of them to your particular needs. Understanding how we think about performance testing may also give you a better understanding of our process and our approach to browser performance. Last but not least, I hope that I’ve given you a little more insight into some of the work going on behind the scenes to deliver IE8.
Christian Stockwell Program Manager