One thing that is becoming far more commonplace across all of our “screens” is the idea of lightweight notifications. Originally, Windows Gadgets were to offer this type of functionality—the idea is a quick heads up display for some critical information (news, weather, sports scores, or line of business events are a few examples). However, the startup time and model of Gadgets are not compatible with reducing overall power consumption (something that is important in a desktop and a laptop) or working to deliver the full-screen platform for developers. In addition, the Start screen of Windows 8 provides a much larger surface to have more of these notifications as well as a user-in-control interface for managing the updates (including usage of networking resources). In a modern experience where more and more information is available via push and in structured snippets, this provides a unique opportunity for developers and end-users. In this post, Ryan Haveson writes about the development of Metro style live tiles and how the architecture scales to large numbers of tiles while also reducing the overall power consumption and system load.
–Steven Sinofsky

We all know that performance and battery life are critically important for a successful release of Windows, and your comments continue to emphasize these attributes. @KISSmakesmeSMILE summed it up well by writing:

“…try matching or better yet, surpassing … [competitor’s] battery runtime achievements on light/low load use.”

At the same time, we know that all modern environments (from PCs to TVs to phones) have some form of gadget, widget, or plug-in model that enables at-a-glance information consumption. Watching TV news, sports, or weather shows a structured screen of information with many sources coming together in real time. People expect to be able to quickly check their stocks, weather, email count, next appointment, line of business status, or even social networking status in a matter of seconds before getting right back to whatever else they were doing. In many ways, one could argue the PC has some catching up to do in this area compared to other devices. As we set out to design our notifications infrastructure, our challenge was in how to make the PC feel alive with activity and remain extremely efficient with respect to power and bandwidth usage. @AndyCadley’s words express the goal well:

“Treat all your "Metro" apps as if they are always running (but at zero impact to battery/performance)”

The Start screen also makes this efficient from a user model perspective by giving you a full screen heads up display without interfering with you desktop or Metro style apps while you are focused on those. In addition, not only did we want to make it efficient, we wanted to make sure that you could install as many notifying apps as you want, without having to worry about the impact on performance or battery life.

One thing we have noticed as we are using Windows 8 internally is that the ability to use the Start screen as a unified and highly readable heads up display for line-of-business applications has become a productivity enhancer. We are seeing a lot of interest in apps that are primarily about notifications. With the scalability of our new push notifications platform, Windows 8 can deliver this capability with minimal system impact, which is a big improvement over the multitude of mechanisms that exist in Windows today. It is not hard to see a scenario, especially early on, where even the most hardcore desktop-only person will find a lot of value in the Start screen as a centralized and well-presented (and controlled) notification area that is just a keystroke away.

Goals of the notification platform

Allowing hundreds of app tiles to be alive with activity, and simultaneously making sure that we don’t degrade performance makes it seem like we have contradictory goals. After all, “activity,” by definition, consumes resources: getting a notification from the cloud uses the network, and rendering the notification on a tile uses GPU/CPU resources, etc. In order to get the design right, we knew we had to stay focused on the goals we started out with:

  • Allow hundreds of live tiles without degrading performance
  • Go beyond balloons, badges and text, with beautiful images
  • Make it easy for developers so they can just “fire and forget”
  • Achieve real-time delivery so delivering “instant messages” is instant

Based on these goals, the first fundamental architectural decision that we made was that the platform would be data-driven, that is, no app code should run in the background to power the Start screen.

If you think about the anatomy of a notification delivery system, it involves several pieces: logic for when to connect, authentication, local caching, rendering, error handling, back-off algorithms, throttling, etc. In addition, the system has to deal with service-side issues such as knowing when you are connected or not, so it can cache undelivered content and handle complex scenarios for retrying. Can you imagine if every single app with a live tile had its own version of all that client/server code? Not only would you have different bugs in each implementation, but you would have duplicates of essentially the same code for each app loaded in memory, with code that is constantly being paged in and out to the disk. This would be really inefficient because it would mean all of your apps would be running all the time to keep the Start screen alive. Even on a machine with lots of memory, system performance would eventually grind to a crawl.

If you read Bill Karagounis’s post on how we reduced the memory footprint in Windows 8, you know that performance degrades as you increase the number of processes, DLLs, services, etc. that are running. If each live tile was running with its own code, we would not have been able to achieve our first goal of allowing hundreds of live tiles without degrading performance.

Our solution was to build a data-driven model. This means that a developer can express their tile using a set of predefined properties and templates, in this case, using an XML schema. The XML tile data is then sent to the Windows Push Notification Service (WNS) via a simple HTTP POST and then we take care of the rest. All the code for connecting, retrying, authentication, caching, rendering, error handling, etc. is done in a uniform and power-efficient way.

Here is an example of one of the many tile templates that developers can use for their Windows 8 apps. This one consists of a text field and a single image, but there are many other templates to choose from.

Image of a surfer, with RSS feed icon, and text "First ever surfboard kickflip recorded in Santa Cruz"
Figure 1: Example template (TileWideImageAndText)

Here is the corresponding XML code that describes the above tile:

<?xml version="1.0" encoding="utf-8"?>
<tile>
<visual lang="en-US">
<binding template="TileWideImageAndText">
<image id="1" src="http://www.fabrikam.com/kickflip.png"/>
<text id="1">First ever surfboard kickflip recorded in Santa
Cruz</text>
</binding>
</visual>
</tile>

The decision to use a data-driven model allowed us to achieve the first two goals (performance and a high-fidelity experience), but we still had to figure out how to achieve real-time delivery and fire-and-forget-it efficiency.

There are two high-level design patterns with client/server content delivery: polling & push. Polling means that the client checks with the service on a regular basis (for example, every 90 minutes) to see if there is new content. Push means that when there is new content, the service sends the data down to the client directly.

The only way to support instant notifications with a polling model would be to poll on a sufficiently high frequency (like every 5 seconds), so if a new message arrives, you’d see it pretty much instantly. But doing so would kill our performance goals—with a 5-second poll interval, the network radio stack would never be idle, battery life would be horrible, and desktop machines would always be powered up. It would be a little like talking on your cell phone all day long—your phone’s battery wouldn’t last long. On top of that, it would be extremely wasteful to check the server every 5 seconds for content, since most of the time there would be nothing new. Historically, system tray notifications and desktop Gadgets introduced in Vista have been implemented using a polling mechanism. But with any polling mechanism, the interval is still not short enough for today's real-time services that are instant.

Thus, for Windows 8 we architected a push-based service. This was a big decision because it meant we would need to build a platform at a global scale, eventually powering the tiles for hundreds of thousands of apps and over a billion people. But the value was clear: developers would get super-efficient real-time notifications to their customers for free, without having to build or maintain their own persistent connections to the client.

The push notification platform

Let’s take a closer look at the various components of the platform to explain some of the more subtle parts of the design. In the diagram below you see three key entities:

  1. Windows Push Notification Service (WNS): This powers live tiles and toast notifications.
  2. App service: This is the web service that a Metro style app runs (e.g. from their existing website), which sends toast notifications and tile updates via WNS. Examples of this would be the back-end service for the Weather app that shipped in the Developer Preview, or a back-end service hosting photos for a social networking app.
  3. Windows 8 client platform: This represents the actual PC and the sub-components in the OS that form the plumbing for the end-to-end experience.

Three graphics shown: App Back-End Service, Windows Push Notification Service (WNS), (which also contains a "Cache"), and Windows 8 Client Platform (which also contains "Tile renderer," "Image Cache" and "WNS Connection" boxes). An arrow marked "1. Push notification" points from App Back-End Service to WNS. Arrow marked "2. Notification" points from WNS to the WNS Connection on the Client Platform. A bi-directional arrow marked "3. Fetch images" runs between App Back-End Service and the Image Cache on the Client Platform.
Figure 2: The push notifications platform

Let’s walk through a typical usage scenario to illustrate how this works. Suppose that the app service is a social networking site that sends a tile update when someone comments on your photo (this could just as easily be a line of business app that updates me when a bug is assigned to me or an expense report needs attention for example). When there is an update, the app service sends a notification to WNS (Step 1 in the above diagram). From there, WNS pushes the notification down to the client (Step 2). When it is time to show the tile update on the Start screen, the OS fetches that image from the app service based on the URL contained in the notification XML (Step 3). Once the notification and the image are downloaded, the app renders the live tile based on the template specified in the XML, and presents it on the Start screen.

As stated earlier, one of our goals was “fire and forget.” So, to ensure that developers did not have to write complex caching and retry mechanisms for when the PC is not connected (e.g. if it is a laptop that is sleeping), we cache one notification per app in the WNS cloud until the next time that PC is online.

As we designed the client platform components, we wanted to make sure that everything was engineered for high performance and low power consumption. One of the key parts of this was separating the notification payload from the image payload. A typical notification XML is less than 1KB of data, but an image can be up to 150KB. Separating these allowed us to save significant network bandwidth for scenarios where there is a lot of duplication of the images. For example, the image for a tile may be a profile picture of a friend, which your PC can download once and cache locally to be reused. Separating the notification from the image also allowed us to be smart about discarding unused notifications before we go through the expense of downloading the image. If my device screen is off and is sitting in my bedroom while I am at work, there is no point in downloading images for tiles that will just be replaced by subsequent updates before the next time I use the device.

The authentication model

Because live tiles and notifications represent a key part of the app experience, it is important that the communications channel is authenticated and secure—all the way from the app service to the tile on your Start screen.  It would be pretty bad if an app or a rogue web service could just update any tile on your machine.  For that reason, we use an anonymous authentication mechanism that uniquely identifies the connection between the PC and WNS. Apps and app services also authenticate when communicating with WNS.  Authenticating both connections to WNS helps to protect against abuse of live tile updates, such as spoofing attacks. The authentication mechanism used by WNS explicitly ties the application and service together in a way that keeps other applications (or nefarious individuals) from sending content to a tile that they do not own. And of course, all communication takes place over a secure channel.

All of this works regardless of whether or not you sign in to Windows using a Windows Live ID. Of course, as Katie Frigon talked about in her post on signing in with a Windows Live ID, Windows 8 is best when you have a connected account as it enables a lot of enhanced experiences such as app cloud storage, roaming Windows and app settings, and single sign-on to multiple apps.  Because the push notification platform uses an anonymous authentication mechanism, even if you do sign in with a Windows Live ID, the developer of the app can’t use the notification pipeline to discover your Windows Live ID, system info, or location.

Building the service to scale

Earlier in this post we mentioned that we had to engineer the platform to support an incredibly large number of users and apps. To give you an idea of this scale, the graph below shows the number of notifications that apps are sending to Windows 8 per day. As of a couple weeks ago, we were already sending almost 90 million tile updates per day, and we are not even at beta yet!

Graph shows notifications at 0 on 9/12/2011, spiking to about 64 million on 9/16/2011, dropping back to 36 million on 9/18, and gradually climbing to the 80 to 85 million range in early October.
Figure 3: Notifications per day sent to Windows 8 Developer Preview build

The Stocks app is one of the popular test drive apps from the Developer Preview build. The following graph shows the total number of live tiles registered with this app in the first month since release of the Developer Preview build.

Total live tiles for Stocks app
Figure 4: Live tiles registered to the Developer Preview Stocks app

When the Developer Preview was released, we started watching traffic coming through the data centers, carefully monitoring our scale-out. Here is a visualization of the actual geographic distribution of notifications in the first few days after the Developer Preview was released at //build. Note that the data represents units per square mile and was fitted to a logarithmic scale to account for a wide range of density values.


Download this video to view it in your favorite media player:
High quality MP4 | Lower quality MP4

The design of WNS is based on the Windows Live Messenger service architecture, and in fact, the service part of the notifications platform was built by the same team. There are not many teams in the world with the expertise and knowledge to be able to build a globally scalable service that can ramp up to such large numbers so quickly. Here are a few statistics to give you an idea of the scale of the Windows Live Messenger service today:

  • 300M monthly active users
  • 630M daily logins
  • 10B daily notifications
  • Over 40M peak SOC (simultaneous online connections)
  • Over 3000 machines routing messages around the world

Transparency into tile resource usage via Task Manager

We were so passionate about the performance aspect of notifications platform that we added metrics in the new Task Manager to allow you to keep track of how much bandwidth the tile platform is consuming for each of your applications. In general, resource usage for tiles should be relatively low. For those of you running the Developer Preview build, go to the app history tab in Task Manager and look at the “Tiles” column to see how much bandwidth each of your live tiles has consumed over the last 30 days.

Heat map of usage history of Metro style apps from 9/17/2011 to 10/17/2011. The "News" app shows 71.9 MB used for Network, 57.2 MB for Metered Network, but only 0.1 MB for Tiles. There are 18 apps listed, and all show either 0 or 0.1 MB usage in the "Tiles" column.
Figure 5: Resource usage of live tiles shown in Task Manager app history

Summary

In Windows 8, we set out to design a notifications platform that would provide at-a-glance information, without all the performance and battery life concerns that face traditional plugin and gadget-based models. To that end, every design decision we made was viewed through the lens of performance and battery life efficiency. To make it easy for app developers to participate, we built the Windows Push Notifications Service so they could create live tiles without having to write complicated network connectivity code. And because WNS uses standard web technologies, such as HTTP POST, it’s easy for developers to integrate notifications based on their existing web services.

The result is a notifications platform that delivers at-a-glance information while allowing you to install as many apps as you want without worrying about the impact on performance or battery life.

--Ryan Haveson