What are the best practices for building retry logic around network transport failures?

Let's define some terms first so that we have a common language for communication. I'll say that "retry logic" is any automatically applied compensation activity that replays the same messages to either the same or a different destination. Also, I'll say that a "network transport failure" is any delivery or communication failure while attempting to transmit an application message or an infrastructure protocol message. With that out of the way, there are two places where you could attempt to handle transport failures.

  • In a layered channel. Use a layered channel when you want the retry logic to be applied to all network calls. A layered channel also allows you to finely control the order of operations by positioning the retry logic channel within the channel stack. Placing the retry logic in a layered channel means that you don't have to deal with it at each application call site.
  • In the application. Use application code to perform the retry logic when a retry decision involves business logic, business rules, or application state. For example, if you encounter an error while sending the third of four related messages, you may need to manipulate application state to reestablish consistency within the system. Placing the retry logic in application code means that you have to make use of it explicitly at each application call site.

Regardless of the location, the retry logic for a network transport failure is going to look fairly similar.

  • You will find out about a transport failure because some network operation threw an exception. The exceptions that you should consider handling will either be a subtype of TimeoutException or CommunicationException. Your retry logic needs to decide whether the specific exception is recoverable. That decision depends on both the types of network operations that you’re performing and the types of failures that your application is resilient to.
  • Before attempting to retry an operation that uses the same channel, you first need to check that the channel is still usable. If the channel state is anything except for Opened, then you will be unable to send messages using that channel. The only thing that you can do with a Closed or Faulted channel is to throw it away and create a replacement.

Next time: A Call to SSPI Failed