AKA: How I spent last week :).
On Tuesday Morning last week, I got an email from "email@example.com":
You've probably already seen this article, but just in case I'd love to hear your response. http://it.slashdot.org/article.pl?sid=07/08/21/1441240 Playing Music Slows Vista Network Performance?
You've probably already seen this article, but just in case I'd love to hear your response.
Playing Music Slows Vista Network Performance?
In fact, I'd not seen this until it was pointed out to me. It seemed surprising, so I went to talk to our perf people, and I ran some experiments on my own.
They didn't know what was up, and I was unable to reproduce the failure on any of my systems, so I figured it was a false alarm (we get them regularly). It turns out that at the same time, the networking team had heard about the same problem and they WERE able to reproduce the problem. I also kept on digging and by lunchtime, I'd also generated a clean reproduction of the problem in my office.
At the same time, Adrian Kingsley-Hughes over at ZDNet Blogs picked up the issue and started writing about the issue.
By Friday, we'd pretty much figured out what was going on and why different groups were seeing different results - it turns out that the issue was highly dependent on your network topology and the amount of data you were pumping through your network adapter - the reason I hadn't been able to reproduce it is that I only have a 100mbit Ethernet adapter in my office - you can get the problem to reproduce on 100mbit networks, but you've really got to work at it to make it visible. Some of the people working on the problem sent a private email to Adrian Kingsley-Hughes on Friday evening reporting the results of our investigation, and Mark Russinovich (a Technical Fellow, and all around insanely smart guy) wrote up a detailed post explaining what's going on in insane detail which he posted this morning.
Essentially, the root of the problem is that for Vista, when you're playing multimedia content, the system throttles incoming network packets to prevent them from overwhelming the multimedia rendering path - the system will only process 10,000 network frames per second (this is a hideously simplistic explanation, see Mark's post for the details)
For 100mbit networks, this isn't a problem - it's pretty hard to get a 100mbit network to generate 10,000 frames in a second (you need to have a hefty CPU and send LOTS of tiny packets), but on a gigabit network, it's really easy to hit the limit.
One of the comments that came up on Adrian's blog was a comment from George Ou (another zdnet blogger):
""The connection between media playback and networking is not immediately obvious. But as you know, the drivers involved in both activities run at extremely high priority. As a result, the network driver can cause media playback to degrade." I can't believe we have to put up with this in the era of dual core and quad core computers. Slap the network driver on one CPU core and put the audio playback on another core and problem solved. But even single core CPUs are so fast that this shouldn't ever be a problem even if audio playback gets priority over network-related CPU usage. It's not like network-related CPU consumption uses more than 50% CPU on a modern dual-core processor even when throughput hits 500 mbps. There’s just no excuse for this."
""The connection between media playback and networking is not immediately obvious. But as you know, the drivers involved in both activities run at extremely high priority. As a result, the network driver can cause media playback to degrade."
I can't believe we have to put up with this in the era of dual core and quad core computers. Slap the network driver on one CPU core and put the audio playback on another core and problem solved. But even single core CPUs are so fast that this shouldn't ever be a problem even if audio playback gets priority over network-related CPU usage. It's not like network-related CPU consumption uses more than 50% CPU on a modern dual-core processor even when throughput hits 500 mbps. There’s just no excuse for this."
At some level, George is right - machines these days are really fast and they can do a lot. But George is missing one of the critical differences between multimedia processing and other processing.
Multimedia playback is fundamentally different from most of the day-to-day operations that occur on your computer. The core of the problem is that multimedia playback is inherently isochronous. For instance, in Vista, the audio engine runs with a periodicity of 10 milliseconds. That means that every 10 milliseconds, it MUST wake up and process the next set of audio samples, or the user will hear a "pop" or “stutter” in their audio playback. It doesn’t matter how fast your processor is, or how many CPU cores it has, the engine MUST wake up every 10 milliseconds, or you get a “glitch”.
For almost everything else in the system, if the system locked up for even as long as 50 milliseconds, you’d never notice it. But for multimedia content (especially for audio content), you absolutely will notice the problem. The core reason behind it has to do with the physics of sound, but whenever there’s a discontinuity in the audio stream, a high frequency transient is generated. The human ear is quite sensitive to these high frequency transients (they sound like "clicks" or "pops").
Anything that stops the audio engine from getting to run every 10 milliseconds (like a flurry of high priority network interrupts) will be clearly perceptible. So it doesn’t matter how much horsepower your machine has, it’s about how many interrupts have to be processed.
We had a meeting the other day with the networking people where we demonstrated the magnitude of the problem - it was pretty dramatic, even on the top-of-the-line laptop. On a lower-end machine it's even more dramatic. On some machines, heavy networking can turn video rendering to a slideshow.
Any car buffs will immediately want to shoot me for this analogy, because I’m sure it’s highly inaccurate (I am NOT a car person), but I think it works: You could almost think of this as an engine with a slip in the timing belt – you’re fine when you’re running the engine at low revs, because the slip doesn’t affect things enough to notice. But when you run the engine at high RPM, the slip becomes catastrophic – the engine requires that the timing be totally accurate, but because it isn’t, valves don’t open when they have to and the engine melts down.
Anyway, that's a long winded discussion. The good news is that the right people are actively engaged on working to ensure that a fix is made available for the problem.
Thanks for the response.
I guess I misunderstood. I thought Mark's post was saying MMCSS threads may use up to 80%, leaving 20% for everyone else.
I still hope the solution allows the user to control the tradeoff.
What I wondered was why the network packets were being handled in DPCs (and therefore the scheduler doesn't get a look-in) rather than being moved off to a worker thread or to a thread in the application which created the socket (and yes, I'm aware that the Windows file sharing and HTTP in Windows Server 2003 are implemented as drivers). Obviously there are parts of the TCP/IP protocol suite that don't end up in applications - ICMP Echo processing, for example - which would have to be handled by a shared pool of worker threads.
I suppose that in the case where TCP is receiving a large response from a server and has nothing to send in the other direction, it has to generate ACK packets on a timely basis to keep the window filled.
Is the latency for getting the processing onto a worker thread just too high for this to work?
Mike, the answer's tied up in the dark ages of NT's history, but essentially the issue is that the network stack passes an indication at DPC time to the higher level components (TCP, UDP, then to RDR and SRV or WINSOCK). Those components then get the opportunity to decrypt/decode/interpret the data in the indication data before they post a receive to retrieve the data. TCP also generates acks at that time.
I don't know anything about the decisions that the networking team made, so I don't know about the worker thread thing.
Larry - You said interrupts are serviced by CPU0 on both multi core and true MP machines - Really? I am fairly certain OS can balance the Interrupts on various CPUs or can choose to pin them down to one CPU. So if Windows is designed to not balance IRQ across CPUs it is a Windows design limitation and not hardware limitation. Or am I missing something here?
Linux for example uses irqbalance daemon for balancing interrupts - http://www.irqbalance.org/documentation.php specifically mentions -
"Intel chipsets (and similar chipsets from other vendors) use something akin to a table (it's programmed into a component called IO-APIC) for this, and this "table" maps specific interrupts to specific cores or sets of cores. The standard table in our hardware effectively maps all interrupts to core 1 of the first socket. While this works, it also means that under high utilization (for example, on a really busy network) this core gets to spend a disproportional amount of work on processing the interrupts.
It is the task of the interrupt balancing software to distribute this workload more evenly across the cores: to determine which interrupts should go to which core, and then fill this table for the chipset to use."
Here's a better explanation and how you can work around the issue with jumbo frames.
The point is that the 10K packets per second rate limit is hard-coded for the worst case scenario. It does not account for faster multi-core CPUs. My Core 2 Duo E6400 could have easily been set to 50K packets per second and the same assurances to multimedia would have been guaranteed.
The other problem is that MMCSS doesn’t distinguish between idle, simple music playback, DVD, or HD Video playback since those clearly have different processor requirements.
This is a poor design and Microsoft will need to fix this and the simplest and most reasonable way to address this is with a content and CPU aware dynamic rate limit for networking performance. Making the rate limit per LAN interface and not amongst all the interfaces is probably a good idea too and that’s an obvious bug that needs to be fixed.
OSGuy: I don't know how Linux handles it, I just know that the guys who know this kind of stuff over here tell me that multicore machines handle interrupts on CPU0. Some MP designs handle interrupts on separate processors, but the majority of the inexpensive (ie non server ones) just interrupt CPU0 - it's cheaper :).
I can't speak to how Linux handles this, I'm not a kernel guy.
George: I don't think that anyone is defending the decision to go with 10K packets/second. Certainly during the internal discussion of the problem (and I was on all of the emails) keeping the limit was never one of the suggestions tendered. Btw, I do like your jumbo frames idea, it's a good one.
As Mark (and I) have said, we're going to address this issue.
Larry - CPU0 is the default for handling interrupts, that's how the table is programmed initially - most modern OSes dynamically reprogram the APIC to distribute the interrupts if CPU0 is overburdened with handling interrupts and other CPUs are relatively idle.
Anyway I assumed you were interested in knowing this, so apologies if that wasn't the case.
George: Audio uses the "Audio" category (or the "Pro Audio" category, video playback uses the "Playback" category.
There are many possible solutions to the problem, some of them have been mentioned on this thread, some of them aren't. The relevent teams have commited to providing a solution to the problem.
There is a large variation on CPU requirements on HD video playback depending on the graphics card you use. If you use a top of the line $500 ATI 2900, it ironically requires more than 60% of a dual-core processor to play back 1080p VC-1 video. If you use a cheap $50 ATI 2400 with full VC-1 bitstream decoding and offloading, you can expect to see 7% CPU utilization on a dual-core processor.
I don't have a problem with the throttling mechanism and I think it's needed. It’s just that it needs to be more intelligent and account for the varying types of media playback and the capability of the CPU and GPU. Now clearly, there’s no reason why MMCSS should engage the packet rate limiter (no matter how much) if Windows Media Player 11 is sitting idle or paused.
Larry, it's pretty sad that an OS 6 years in development, Microsoft's crown jewels is bested by XP. Really really sad and pathetic, in fact.
It seems to me that the main culprit here is that the multimedia stack in Windows Vista is always geared towards low latency, whereas most of the time this isn't needed. For regular multimedia playback, I would actually prefer that the audio stack _not_ try to maintain a 10ms mixing interval, because that burns more CPU time in context switches and disturbs video display timing (which can be critical in windowed display). For regular non-interactive playback, all you need is matched latencies between the audio and video streams. You can mix every 100ms and still have perfectly synced, glitch-free audio.
A 10ms mixing rate also seems like a bad idea for regular desktop usage on laptops, where you want to lower tick rates so the CPU can sleep a bit. Any thought to making this adaptive or tunable in a future version?
Can you believe that Mark's blog entry's comments section is full of suggestions to kill MMCSS.
I wrote a comment there and I would like to suggest it to readers here too, have a look a this page:
try tweaking the parameteres mentioned there and tell your results to others.
10000 packets really isn't all that much if your packets are just 64 bytes. That's just 5 mbps. Well below some broadband speeds.
George: This isn't a CPU utilization problem. If the multimedia processes were allowed to run at all, they'd be able to do their work. But in this scenario, without the throttling, under certain network workloads, the multimedia processes don't get scheduled for tens and hundreds of milliseconds. Which causes massive glitching on even non HD content.
The types of playback don't actually matter.
T: Actually Vista was 2.5 years in development - there was another year of prototyping called Longhorn :).
Phaeron: Actually we're getting hammered because we're currently too high latency for certain very common workloads - like voice communications.
Tom M: Yes, but how many real-world workloads use 64 byte packets at 10,000/second? Tha's how I was able to reproduce the problem on my 100mbit network: I used 600 byte packets.