Developing Distributed Systems
Over the last five years I've had the fortune to do research and development on several different distributed systems for Microsoft. I've learned a lot from these efforts, some of them things I embarrassingly knew a decade ago and promptly forgot. Here's the three most important:
- Know what you're building. As a corollary, make sure you can tell when you're done.
- Establish 'correctness' criteria to let know you when things work.
- Have a way to figure out what's wrong when your system doesn't work.
The first two aren't too hard as long as you think of them, but the third can be a real bear, especially for distributed system developers.
Take developing PNRP (the Peer Name Resolution Protocol) for example. This is a complicated P2P service which is a special-case of a distributed hash table (DHT). To the untrained eye, PNRP appears to route messages randomly through the set of all people running the service, making it difficult to know if it's working correctly except by final transaction results. By mid 2002 the PeerNet team had PNRP implemented and up and running in the lab. We did a simple test out in the wild with a couple dozen participants. The moment of truth came, and we started our services. Each of us registered our email name, and then we tried to resolve each other.
To our horror, it didn't work. Even worse, we didn't have any obvious way to diagnose the problems.
In the absence of custom tools, everyone ran NetMon (Microsoft's network sniffer) and saved 5-minute snippets of trace as we would all bring the system up and try something. The traces were saved off, collected together, and then handed to a very talented network developer to examine. To his credit, after a few days of work he had at least some answers for the first five minute session. You'll never grok just how big an achievement this was, unless you happen to be in the handful of people who have developed P2P protocols.
The next time I helped develop a P2P protocol from scratch was as part of a research project at MSRC. The researchers working on it were already familiar with the woes of debugging distributed systems, so from the very beginning we ensured we could debug any issues that arose. There are two good ways of doing this.
- Simulation. This will help you catch a healthy chunk of systemic problems, but ultimately it's only as clever at finding problems as you are at manufacturing them. There's no way to get away from the need for #2,
- Centrally coordinated logging. Servers are anathema to P2P, but the truth is you should never shy away from a great solution on a purely idealistic basis.
Centralized logging is easily the most important tool we had for validation and troubleshooting. It let us determine which nodes had connectivity, whether they were working they way they should, throughput and consistency numbers, node connectivity, and so on.
If you're developing a distributed system, I highly recommend implementing a logger with the following properties
- Have a centralized log-server which receives all log data. Use a 'cloud ID' and 'node ID' for each party logging data to separate their activities.
- For each message you generate and transmit, log both local generation time and central server receive time. This will let you establish a global ordering of logged messages, critical for understanding the system's behavior.
- Have a (removable) logger for the P2P client which logs activities to the central server.
- Logger transport should be reliable, to prevent message loss. In other words, even when everything else breaks, your logger should still be able to get all of its messages to the central log server. This can be a serious amount of work, but is well worth it.
- Logs should be machine-parsable, to enable automated analysis and processing
- Logs should be human-readable, or translatable into human-readable form. Once you work with the logs for a while you'll get an intuition for reading them, and find problems that would have been impractical to find with automated processing.
Figuring out the right granularity of messages to log can be tricky. Here's what was useful for us:
Log messages for each of the following:
- Client startup, client shut down
- Connection data: connection initiated, succeeded / failed, remotely initiated connection received, connection torn down. You'll be amazed how many problems boil down to the right peers not communicating to each other.
- Periodic snapshots of activity. For example, every 60 seconds log bytes sent / received, connections received / rejected / initiated in that 60-second interval.
- Significant events, or summaries of significant events. If you're doing file sharing, a significant event might be transferring a file block to a remote party. If you're doing name resolution, it could be receiving and retransmitting a name resolution request.
There's much more you can log, but even these essentials will go a long ways towards helping you understand your system's behavior and validating it.
Have you developed a distributed system? If so, did you use logging? Or, if you're just getting ready to start developing, are you planning on logging? I'll be very interested to hear...