Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

What's wrong with this code, part 5: The answers

What's wrong with this code, part 5: The answers

  • Comments 19

Yesterday’s post was a simple implementation of TransmitFile.  First things first: I need to apologize.   I was being a bit sneaky in this case.  The code in the example was written to function correctly, and that mislead a lot of people who were trying to find a bug in the code, when there was no bug in the code.  There was something wrong with the code, but it was a performance issue, not a correctness issue (although arguably a performance issue is a correctness issue).

The problem was that the implementation didn’t consider the interaction between sending data and the Nagle Algorithm/delayed acks.  These algorithms exist to solve two halves of a basic problem with networking protocols.

Whenever you send data from one machine to another, the sender must wait until the receiver acknowledges the request before sending the next buffer (this is a rough approximation, ignoring sliding windows etc).  So when you look at the data flow for a send, on the wire, it appears like this:

Client: Send frame 1 to Server
Server: Send acknowledgement for frame 1 to client
Client: Send frame 2 to Server
Server: Send acknowledgement for frame 2 to client
etc.

There is a common misconception that the amount of time it takes to send a message from one machine to another is related to the size of the packet.  On some networks this is true (dialup comes to mind immediately).  But one of the fundamental properties of an Ethernet (or token ring) network is that sending one byte of data takes essentially the same amount of time as sending 1568 bytes of data (the max Ethernet packet size).  The critical factor is the number of packets transmitted, not the size of each packet. 

As a result, if the client sends small buffers of data, the server has to acknowledge each of the buffers before the client can send the next buffer.  The Nagle Algorithm as represented in RFC896 coalesces those small writes from the client into a larger buffer, so that the acknowledgement traffic isn’t as daunting.  After 100-200ms of inactivity on the part of the client, the TCP implementation will flush the data out.  As a result, if you’re doing small writes to a socket, if nagling is enabled, then you’ll see that your writes don’t appear on the wire until 100-200 milliseconds after the write.  That wasn’t what was happening in this case, however.

Now consider what happens with a normal request/response type protocol (like POP3):

Client: Send “USER” to Server
Server: Send Ack for “USER” to Client
Server: Send “+OK” to Client
Client: Send Ack for “+OK” to Server
Client: Send “PASS” to Server
Server: Send Ack for “PASS” to Client
Server: Send “+OK” to Client
Client: Send Ack for “+OK” to Server
Client: Send “UIDL” to Server
Server: Send Ack for “UIDL”
Server: Send “+OK” to client
etc.

You’ll notice that there’s twice the number of frames being sent on the wire.  And that each ack is immediately followed by a response from the server.  Now remember that the acks are just as expensive as the data sends.  It turns out that this behavior is known as the “Silly Window Syndrome”, or SWS.  RFC1122 (the TCP specification) specifies a solution to SWS:

A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment.

So the receiver can (and should) delay acknowledging the send before it sends the ACK so it can piggyback the ack on the response.  This behavior is also codified in section 4.2 of RFC2581, TCP Congestion Control.  This “piggyback-ack” behavior isn’t unique to TCP/IP by the way; the Netbeui protocol has also had it, for about as long as TCP has.

Now let’s consider the buggy TransmitFile implementation.  In that case, the client’s issuing multiple sends, but there’s no response.  And since the sends are synchronous, each send can’t begin until after the previous send has completed.  And again, because they’re synchronous sends, TCP can’t buffer the data to coalesce the writes.  It has no choice but to wait for the acknowledgement before the send is completed. 

Compounding this problem is the size of the transmit buffer: 4096 bytes.  It turns out that the NT SWS avoidance algorithm delays an acknowledgement every other frame (details are here, on page 34).  Once you remove the TCP overhead from the 1568 byte Ethernet packet size, the maximum payload size for TCP on Ethernet is 1460 bytes per frame.  That means that a 4096 byte buffer takes 3 frames to transmit.  That’s an odd number of frames, so, the receiving system defers the response for 200 milliseconds in the hope that the receiver will respond to the sender.

And that’s the crux of the bug.  The buffer size used for writing the file data is will take an odd number of frames, which means that each write takes 200 milliseconds.  Simply increasing the buffer size to 8192 bytes will speed it up by several orders of magnitude, because 8192 byte buffers take up an even number of frames (6).  This is a hack solution though, a more correct solution uses overlapped I/O to keep the network adapter saturated.

I’ve found that whenever ANYONE says that their network I/O’s are taking too long, the problem is almost always related to one of these two problems (either nagling or delayed acks).   And this isn’t a theoretical problem either.  Over the last weekend, I decided to write up this series of articles, and when I came in on Monday morning, I was told that there was a performance problem with the Windows Media Connect HTTP server.  My first thought was that it was problem with the layer reading data under the server, after that was ruled out, my next thought was that the problem was deferred acks.  And after about 15 seconds of looking at an ethereal trace, I realized quickly that that was the problem.  Fortunately, the fix was relatively simple, but the problem is VERY real.

This was not an isolated case – about every six months or so, someone posts a message on the internal performance analysis alias asking for help with a networking performance problem, the answer is almost always related to either nagling or delayed acks.

 Kudos:  This turned out to be harder than was expected.  Nicholas Allen came up with the right answer, and Simon Cooke expanded on it quite nicely. 

And for the unanticipated bugs:

Simon Cooke pointed out that the algorithm sends 0 byte writes if the file size is exactly a multiple of the buffer size. 

Anon pointed out that there’s no check for null buffers.  This is critical if there’s no header or trailer specified, the code would crash if the user didn’t specify them.

And, because sockets are opened for overlapped I/O, the synchronous WriteFile call to write to the socket only works if there are no other threads interacting with the socket.  So the article should have stipulated that the app was single threaded.

 

During my research for this article, I ran into a rather nice write-up on SWS and this problem in this article on speeding up web servers.

 

  • FWIW: A couple of bits of socket trivia.

    1. The loop in TransmitTcpBuffer() is unnecessary if called with a stream-oriented socket (like, say, TCP). Calling WriteFile(), send(), or WSASend() on a stream-oriented socket will either a) completely send the entire buffer, or b) fail.

    2. Zero-byte sends may or may not actually cause packets to hit the wire, depending on the underlying protocol. For TCP, it's a noop. For UDP, it sends a zero-byte datagram. For message-oriented protocols (like SPX), it sends a zero-byte message.
  • Thanks Keith, trust the guy who wrote the code to know the REAL answer :)

    Btw, the WriteFile()/send() behavior does NOT apply in general, even with Henry's TCP.

    For Exchange, we had issues with partial writes on Win16 and (I believe) Win9x, the RPC runtime didn't handle partial writes properly and that caused some "interesting" bugs.
  • Thanks for your detailed answer. I have a question.

    Doesn't Windows have a per socket send/receive buffer? Doesn't send() return after the data is stored in the buffer (instead of waiting for data to leave the wire).

    What about receive, when a TCP packet is received from the adapter doesn't it go to the receive buffer first? How big is the buffer? How can the size be changed?

  • Yes it has a per-socket send/receive buffer.

    But most networking APIs (and in this context writefile is a networking API) guarantee that the receiver has acknowledged receipt of the data before the write completes. Otherwise you could lose data and not have any way of knowing it.

    There ARE varients that don't guarantee that data is received, but that's clearly spelled out in the API.
  • 8/5/2004 3:39 PM Anonymous

    > I have a question.

    Give or take an order of magnitude.

    > Doesn't send() return after the data is
    > stored in the buffer

    Surely that part of it depends on the protocol and the socket options.

    8/5/2004 3:46 PM Larry Osterman

    > But most networking APIs [...] guarantee

    OK, UDP isn't "most", but are the proportions really so unbalanced? This makes me wonder if people who argued for putting acknowledgement in the original Ethernet design might have been right.
  • Norman, you're right. I believe that you can configure sockets to return before the data is delivered, but that's NOT the default behavior.

    And yes, I think that the proportions are that unbalanced. Networking is fundimentally an unreliable medium, but programmers want reliable transfer mechanisms. So the networking guys designed their APIs to be as reliable as possible. If you want unreliable, you need to ask for it.
  • Norman, one example of a "you can shoot yourself in the foot if you want" is the Netbios sendnoack option - that would allow sends to complete before the receiver had completed them, but you had to handle the APIs specially (I don't remember the details now, it was about 15 years ago when I dealt with them).
  • > I believe that you can configure sockets to return before the data is delivered, but that's NOT the default behavior.

    This doesn't make any sense. There would be no point whatsoever to the socket's send buffer if this were the case. A blocking send() would simply lock the application's buffer until done. The docs imply that they don't work this way.

    Even send() explicitly states "The successful completion of a send does not indicate that the data was successfully delivered." (http://msdn.microsoft.com/library/en-us/winsock/winsock/send_2.asp)

    WriteFile() shouldn't be special in this regard either, since you can't even guarantee it writes to a _file_ unless you disable all buffering on the handle, or flush it afterward. Why would the network be special?
  • I followed this up with an extremely ugly and utterly non-robust test program:

    #include <stdio.h>
    #include <winsock2.h>
    #include <windows.h>

    int main(void)
    {
    SOCKET s;
    WSADATA wsa;
    struct sockaddr_in sa = {0};

    sa.sin_family = AF_INET;
    sa.sin_port = htons(9000);
    sa.sin_addr.S_un.S_addr = inet_addr("192.168.0.6");

    WSAStartup(MAKEWORD(2,2), &wsa);
    s = socket(AF_INET, SOCK_STREAM, 0);
    connect(s, (struct sockaddr *)&sa, sizeof(sa));
    printf("connect() finished\n");
    getchar();
    send(s, "123", 3, 0);
    printf("send() finished\n");
    getchar();
    closesocket(s);
    WSACleanup();

    return 0;
    }

    I used Sysinternals' TCPView to monitor things, since I obviously left error checking out of the code. After the connection was made, I unplugged the target machine's network cable and tapped enter. The send() completed immediately.

    What was fun to see was when I tapped enter again to close the socket and exit the program, and only then plugged the target back in. Shortly afterward, it received the data :)

    What puzzles me at the moment is that I couldn't make WriteFile() work as a drop-in replacement for send() there. It always reported ERROR_INVALID_PARAMETER, so I must be missing something.

    On a different note, why does a connected socket appear to hold the same local port open for listening as well? For example, in one connection, local 192.168.0.5:6747 was ESTABLISHED to 192.168.0.6:9000, and local 0.0.0.0:6747 was LISTENING. It's not possible to connect to, but I'm curious why it shows up at all.
  • Sorry Larry, Your blog is usually quite excellent, but in this case there is definitely
    something wrong with what you are saying.

    Assuming that no special socket options have been set, there are no serious perf problems with the code.

    Each socket by default has a kernel buffer of 8192 bytes. If the data you are writing in the WriteFile fits in the buffer
    WriteFile returns immediately.

    The behaviour you describe will only happen if you have disabled the socket buffer by setting SO_SNDBUF to 0

    An article that describes these issues very well can be found at
    http://support.microsoft.com/default.aspx?scid=kb;en-us;214397

    I have made some quick tests to verify what I am saying. I used pcattcp.exe which can be downloaded from
    http://www.pcausa.com/Utilities/ttcpdown1.htm

    Sending data in 4096-byte chunks with default options: 0.36 ms/call
    SO_SNDBUF=0: 7.29 ms/call
    SO_SNDBUF=0, TCP_NODELAY=1: 7.21 ms/call

    H:\ttcp>pcattcp -t -l 4096 srmutv
    PCAUSA Test TCP Utility V2.01.01.07
    TCP Transmit Test
    Remote Host : srmutv
    Transmit : TCP -> 192.71.83.122:5001
    Buffer Size : 4096; Alignment: 16384/0
    TCP_NODELAY : DISABLED (0)
    Connect : Connected to 192.71.83.122:5001
    Send Mode : Send Pattern; Number of Buffers: 2048
    Statistics : TCP -> 192.71.83.122:5001
    8388608 bytes in 0.72 real seconds = 11362.00 KB/sec +++
    numCalls: 2048; msec/call: 0.36; calls/sec: 2840.50

    H:\ttcp>pcattcp -t -l 4096 -b 0 srmutv
    PCAUSA Test TCP Utility V2.01.01.07
    TCP Transmit Test
    Remote Host : srmutv
    Transmit : TCP -> 192.71.83.122:5001
    Buffer Size : 4096; Alignment: 16384/0
    SO_SNDBUF : 0
    TCP_NODELAY : DISABLED (0)
    Connect : Connected to 192.71.83.122:5001
    Send Mode : Send Pattern; Number of Buffers: 2048
    Statistics : TCP -> 192.71.83.122:5001
    8388608 bytes in 14.57 real seconds = 562.21 KB/sec +++
    numCalls: 2048; msec/call: 7.29; calls/sec: 140.55

    H:\ttcp>pcattcp -t -l 4096 -b 0 -D srmutv
    PCAUSA Test TCP Utility V2.01.01.07
    TCP Transmit Test
    Remote Host : srmutv
    Transmit : TCP -> 192.71.83.122:5001
    Buffer Size : 4096; Alignment: 16384/0
    SO_SNDBUF : 0
    TCP_NODELAY : ENABLED (1)
    Connect : Connected to 192.71.83.122:5001
    Send Mode : Send Pattern; Number of Buffers: 2048
    Statistics : TCP -> 192.71.83.122:5001
    8388608 bytes in 14.42 real seconds = 568.06 KB/sec +++
    numCalls: 2048; msec/call: 7.21; calls/sec: 142.02
  • Wait a sec, is the Nagling algorithm turned on by default? If not, how do I turn it on for a socket? (I have an app that sends multiple small packets)
  • I'm calling in the big guns for help (asked the winsock lead for comments). He may comment later. But here's my comments off the top of my head.

    Question: As far as I know, nagling is ON by default. The easy way to tell is to run Stephan's test with 10 byte writes and look at the results from a remote machine - if the data is transmitted on the wire immediately in 10 byte writes nagling's disabled. If it's transmitted in larger writes it's enabled.

    Stephan, did you try the same test with an odd number of frames with a buffer GREATER than 8192 bytes that takes an odd number of frames? Also, to see this behavior you need TWO machines - one transmits the data, the other measures the throughput. Having netmon (or etherial) capture the data would also show the behavior.

    It's possible that http.sys sets SO_SNDBUF to 0, which is why we didn't see this in WMC earlier this week.

    But the behavior is VERY real, and it's burned more people than I can think of.
  • Assuming Larry is right, and assuming default send buffers, isn't it difficult to make sure the sent data is acknowledged. Say one thread sends 100 byes using WriteFile() and waits for ack, and another thread sends 200 bytes on the same socket. Now on a stream protocol like TCP, the driver has to mark the boundary after the first send and monitor it, and it is painful to implement that feature. This gets complicated with overlapped I/O.

    Coming from the unix world, I think the only way to guarantee data delivery is to set the send buf to 0 bytes, even that probably only waits for data to leave the wire, doesn't wait for the ack (I am not sure though).

    This an interesting discussion, Thanks StefanG and Random Reader for your thougts.
  • From the winsock wizards, when I asked them:
    Yes, Winsock has implemented this default 8k buffer exactly for this reason - make such naive apps work. After all most of the apps out there send relatively small packets and cost of extra copy is cheaper that locking the buffer and sending directly from application space. And for those that need to send files, there is a TransmitFile API that implements proper dual buffer technique.

    So my app would be broken given a larger buffer size. I'm still trying to get confirmation that http.sys sets SO_SNDBUF to 0, which would complete my understanding of why I messed up so badly.

    As I said: This IS a real problem, and it shows up with great regularity, which is why I wanted to write about it. It was my mistake in not actually verifying this with a real application :(

  • Anon: I set the buffer to 0 on the high-throughput networking stuff I'm doing. Couple it with overlapped io and it allows one to at least get a certain amount of control over network throttling.
Page 1 of 2 (19 items) 12