BizTalk Core Engine's WebLog

  • Debugging Routing Failures

    Okay, I disappeared for a while. There is this crazy thing called "work" which my bosses (for those of you who are confused, Scott Woodgate is not my boss :) seem to think I need to do sometimes. Some positive news is that BizTalk 2004 SP1 should be released early next year which is a great thing. When it comes out, I recommend getting it, running it through your test environments and then plopping it into your systems. There are some good things done here and I have seen it improve some performance scenarios up to 10% (although I don't gaurantee any of that just consider it another way for me to make you more likely to try it out :). We are definitely trying.

      In other news, it looks like I might be talking again at teched 2005. Plans are to do more performance related talks (similar to last years expect probalby with more concrete scenarios as examples and with more lessons learned from what customers actually try to do). I am also slated to give a talk on operational health ... how to setup a system for high availability and how to monitor your system to make sure it stays healthy and what to do when certain indicators go off. If you have any other areas you think would be good talks for me to give, let me know. Always curious. :)

    Okay, now to the topic I mentioned here. I am guessing that by now, if you have been using BTS for long enough, you have figured this out already, but I might as well put some information here anyways so people can find it.

    First, what is a routing failure. My previous post describes how BizTalk sits on top of a pub/sub routing engine which is part of the MessageBox. When the messaging engine or orchestration engines publish a message to the messagebox, if no one is subscribing to that message, that is considered a routing failure and an event will be raised, a routing faliure report (described later) will be generated, and possibly a message / instance will be suspended. There are a couple of expections to this like for NACKs where the engine knows that routing failures are exceptable but these are only for internal messages. This "Ignore Routing Failures" type functionality is not something you can configure and while I am sure that the hunt is on now, you cannot hack this up either and it wouldn't be supported. :) Back to the real story. So how do you figure out why it failed to route??

    The routing failure report is literaly a dummy message associated with a dummy instance. The only really interesting part to this message is the context. This is the context of the message which failed to route at the time of the routing failure. It is possible (probable) that the message which gets suspended will not have all of the contextual information which was there when the message failed to route since we ususally suspend the adapter's message, not the message out of the pipeline which we tried to publish. That is why we generate these reports so that you have a chance to see what really happened. If you open up the context in HAT, you can see the list of properties which were promoted. Now why didn't it route. 99.99% of the time routing failures occur is when you are in testing. They usually occur because you have a misconfigured filter on your sendport or your activation receive. The easiest way to see this is to use the subscription viewer located in the <installdir>\sdk\utilites directory. This tool is a bit rough (sorry, I am not really a UI guy :), but it gives you a view of what your subscriptions are. Ideally you have some idea of where you expected this message to go. Most subscriptions will have readable names so you can find the one associated with the sendport / orchestration you were expecting it to route to and check the subscription. Simply compare the properties which we are subscribing to to the properties which were promoted in the context. A couple of gotchas which I think are more difficult to see and not well displayed. First, you cannot route a message to 2 solicit response ports. We do not support that cause we have no idea what that means. You sent out one message but got two responses. Request response is considered a one - to -one operation. I know there are lots of scenarios for doing more, but to cleanly support those would require exposing a lot more in the engine like how many people you actually routed the message to so that you can correctly consume all of the responses. This is not something we are planning on doing anytime soon. So, you should know that a routing failure will be generated if you try to route to multiple solicit reponse ports. Another boundary case is when you try to route a previously encrypted message. The host to which the message is being routed (be it the orchestatraion's host or the sendport's host) must "own" the decryption certificate. This is because we do not consider the receive port as the destination of the message. It's job is simply to convert the message into a desired format and extract all relevant information including information required to route the message. The orchestration / sendport is the destination of the message. As such, they need to have the certificate to demonstrate that they could have decrypted the message if it hadn't already been done for them. Adding the certificate can be done in the Admin MMC via the proprties setting for the host. I am not sure if you get a different error in the eventlog for these two boundary cases. Not sure. All of these cases, though, can be debugged with the Routing Faliure report in HAT, the subscription viewer, the eventlog and a bit of knowledge of what you system is actually trying to do.

    If you get routing failures in production, it is going to be something in your design. The most common cases I know of is you have a request response call out of an orchestration, but the response receive shape is in a listen with a timeout. Hence if you hit the timeout and terminate the orchestration, and then the response gets sent back, well the messaging engine could get a routing failure. In general, these types of scenarios either end up in zombies or routing failures since it is simply a race. Not all zombie scenarios cause routing faliures since it is often the case that if the instance is gone, the message might trigger the creation of a new instance (as is the case for a lot of convoy scenarios). You can read more about these subjects in earlier blog entries. In general, though, in this case, it is up to you to decide how / what you want to do with this response since the original sending service is gone. I can't really think of other scenarios where you would hit this in production. It is going to be built into your design, somewhere. Some race condition exists in your design that can cause this ... almost always because you have a listen with a timeout or perhaps a listen with a receive on both sides and both can happen (like with control terminate messages).

    Hope that gives some insight. Hopefully I will get a couple more posts in this year. :) Have a happy holiday season.

     

    Lee

  • Another Kevin is posting. :)

    Kevin Lam has decided to start his own blog. He's been on the team for a while now and is one of the product team's frontline guys. He is the guy out there working with the big customers, helping with their designs, getting there stuff into our labs, asking the questions and making sure we keep the customer focus. I recommend reading his blog.

  • Msgbox Operations Advanced Queries Download available

    Okay, GotDotNet is way too slow getting stuff up. I guess they have to proofread and be careful since it is a public site. So, in the mean time, my friend (not my boss which appears to be the impression some people have. Scott and I got a good laugh) Scott Woodgate has let me put the doc on a private site of his. So this doc is now available at:

    http://home.comcast.net/~sdwoodgate/BizTalkServer2004AdvancedMessageBoxQueries.doc?

    Take a look, read it over, use it when appropriate, send feedback. It has a lot of useful information on how to gather information programmatically from the msgbox. Hopefully all of this type of information will be exposed as perf counters, or through a BTS UI or some apis in the next release, but for now, here you go. I believe Paul Somers has also spent some time and might be making available a tool which will convert these queries into perf counters which you can adminster and monitor. I hope this helps a lot of you out.

     

    Thanks
    Lee

     

  • New download coming

    Okay, so I don't have a link for it, but I have submitted it so hopefully it will show up shortly. I have written a paper entitled Biztalk 2004 Advanced Messagebox Queries which is designed to help you automate a lot of your operational health management work and also perform advanced troubleshooting of your system. My original version had what Scott Woodgate referred to as a lot of "Lee'isms" so he editted it a bit, but still left some of my humor. :) That version has for the time being been posted to gotdotnet (I don't have a link yet, I will update this with the link, but hopefully it will pop up this weekend so you can look for it). An official version will be posted to MSDN shortly but my friend Syd is now working to formalize it (which means take out all of the fun and leave it very dry. Apparently my humor does not always translate well :) ). Like I said, I will post the link when it becomes available, but look for it. Eventually I will probably take it off GotDotNet and just point to MSDN version as we would do updates to that doc. As the doc points out, we are attempting to automate and make easier as much of this as we can, but I and the rest of the product team realized you bought the current version and need whatever help you can get, now. So this is the first in a series of papers I will hopefully get to (probably most will be much more official and you won't see my name on them). Hope this helps all of you out there. Let me know if there are any other types of information you need to find. A special thanks to Paul Somers who helped with identifying some of the usefull queries. I don't know if he has a blog, but he has a cool tool you can find on GotDotNet which took 1st place in the BizTalk Dev Competition. Congrats to Paul.

    Thanks
    Lee

  • Document Normalization

    Okay, I have seen this copied on two different blogs, so it is apparently usefull enough that others might be interested. I posted this originally on a public discussion alias in response to a question on where mappings should be done.

    “While there are actually some performance related reasons to put your maps in the receive and send ports, there are much better business reasons for doing it outside of your schedule. We tend to refer to mapping in receive and send ports as document normalization. In the case of receive ports, you are normalizing the documents from the format of your customers into an internal standard format. On the outbound side, you are converting out of your normalized format and into the specific format of your trading partner or internal application. If you embed the map in the schedule and the partner changes the format, not only do you have to rebuild the map, you have to rebuild the schedule to use the new version of the map. Also, what happens when you add a new partner with a new format. That is a new map and if you have embedded the map in a schedule, it means a new schedule. This is exactly why we added support for multiple maps (one per source message type) on the receive port so that you could create a single location for all of your partners and easily handle normalize into your internal standard formats. Putting these types of maps in schedules would be a bad idea. There are times when it makes sense to use a map in a schedule. When you need to generate a new message in the schedule and use the modified (mapped) contents of an existing message as the base. When you want to map multiple parts of a message into one outbound message (this type of mapping cannot be done in a receive / send port). There are performance gains which come from doing mappings in receive ports sometimes, but they are mostly around how many persisted messages your scenario generates and it is a bit complicated to explain. The actual mapping technology is the same. To keep your internal business logic from getting tightly couple with the document formats of your trading partners, you should do your document normalization (mapping) in the send and receive ports.”

    The key take away from this post is that it is important not to tie your business logic to the format of one trading partner. Performance aside (for those of you who attended my perf talk at tech-ed, yes there are some perf benefits to doing this in the ports), the goal of this design is to make your system more robust and able to change as your business grows and adds new partners and also allow you to react easier to changes in your partners data formats. Brandon Gross comments (in Jeff Lynch's blog) that there are times when the "normalization" is quite complex and it is easier to model this in an orchestration than with our support in the mapper. It is true that there are cases when you simply have no choice but to do the mapping / data conversions in an orchestration, and in those cases, that is what you do. But in general, the best practice I am pushing forward here is a decoupling of your business logic from your partners data formats and so a more robust system.

    Hope this helps

    Lee

  • Neglect

    Okay, I have been neglect and am trying to remedy this. I have a couple of papers I am working on which will get posted here very shortly. Hopefully one will be here by end of the week. I am also busy at work on BTS stuff trying to make your life easier (I hope). Finally I was away visiting some customers and getting some good real world experience. I have actually visitted lots of customers but it is always good to stay knowledgable because it is too easy to get it the glass ball world over here and forget about everyone who really uses this stuff and the problems they face. Never want to have that happen.

    So first off, I just read a write up from Charles Young (who I have now added to my links) on how the subscription routing mechanisms work. Very impressive stuff. He even got some of the sequence of stored procedures being called down. I think it is maybe a bit more technical than a lot of you need, but hey, if you are reading my blog, you must be into this kind of torture, so I would check it out. Good stuff.

    Now let's pick a topic to chat about. I have seen some of the other blogs out there and, well, I am sorry mine is not formatted so nicely. Basically you get me at 11:30 pm feeling like I need to shed some insight and just trying to brain dump. So I guess you will bear with me. Hows about debugging routing failures. Not exactly rocket science to the bts experts, but probably a usefull topic.

    What is a routing failure: Routing failures occur when messages are published into the messagebox, but no service is found to have expressed interest in the message (ie the properties of the message do not match any subscriptions).

    So how do I debug this? -- When a routing failure occurs, the messageagent (see my last post) generates something called a routing failure report. A routing failure report is nothing more than a dummy instance, which holds a reference to a dummy message, which has no parts, but which has the context of the message which failed routing at the time it failed routing. We capture the data like this becomes some adapters do not suspend the message and even when the message is suspended, the context of the suspended message is often different from the context at the time of the routing failure. So what can you do with this. Well really there are only a couple of times routing failures should occur in your system. The first is as you are developing and testing your solution (please test your solution). In these cases you should have a reasonable idea where the message should be going. You can look at the context of the message which failed to route and check which properties were promoted and what their values were. Then, you can use the subscription viewer in the <instsalldir>\sdk\utilites directory to see what the subscripton actually looked like for the sendport or orchestration you thought the mesage should have gone to. Often it is simply that you forgot to promote a property or just got the value in your subscription a bit off. Or you forgot to start the orchestration or sendport.

       The second case where this can occur is when you try stressing a system with orcehstrations which use corelation and you pass in non-unique correlation sets. Don't do this. Try to imagine what is happening with these messages. Now a response which was supposed to go to one orchestration gets broadcast to 20 orchestrations. And then the responses for those find the orchestrations completed and so fail routing. Actually, what would really happen is that half of those 19 would actually get there before the orchestration completed and you might get 9 zombied orchestrations (see earlier blog) and 10 routing failures. Lesson to be leasned, test with real data and in the real world, correlation sets must be unique.

       The third case you might get this in some type of stress is also tightly related to zombies (see earlier blog). If you think about it, a zombie is a race condition where the message got routed to the orchestration right before it completed and so you are "completed with discarded messages". Well what would happen if the raec were a bit different. In that case instead of rouitng just before the orcehstration completed, the orcehstration would complete just before the routing happened. Then you would get routing failures. In these cases, this is what you designed the system to do, so I guess you decide what to do with the failed message at this point. Read my blog on zombies to get a better idea of when zombies can happen.

     

    Sorry this was brief, but it is midnight and I'm tired. Like I said, if things work out correctly, I should have a really good blog coming shortly. Just have to get the signoff from Woodgate and a couple of other people. :)

    Thanks

  • Okay, so where do subscriptions come from?

    So where do subscriptions come from or more importantly when do they appear. As we described in our earlier post, there are two types of subscriptions, activation subscriptions and instance (correlation) subscriptions. All activation subscriptions are created by admin tools like BizTalk Explorer or BizTalk Admin MMC. It does not make sense for the engine to create an activation subscription. (there is one exception to this and that is our caching service, but that is really not so important I just like like to through excessive amounts of detail at you). So these activation subscriptions are created when you do enlistment of your services (sendports and schedules). When you enlist a service, its activation subscription is put in the stopped state. This simply means that all messages routed via this subscription are sent immediately to the “suspendedQ”. Why is this state usefull? What happens if your backend database is down and so you can't send any messages to it. Technically, you could just let them keep failing and throwing errors and wasting your CPU cycles and filling your event log. Or you can stop your sendport and start it back up when the back-end system is backup. When you start a service, the subscription goes active. That means all messsages which are routed via this subscripton are sent to the “workQ” where they can be dequeued as soon as someone is available to process them. Also, when you start a service, we automatically resume any messages which were suspended because the service was previously stopped. Hence you can do what I described above. You could have a service running, the backend system goes down, you can stop the service, causing all messages to be routed to the suspended queue, and when the backend system is backup, you restart the service. Unenlisting a service causes the subscription to “go away”. In most cases, this means we delete it, but in all cases, this means that our no messages will be routed via this subscription. So when you are enlisting, stopping, starting, and unelisting your services (sendports and schedules) what you are really doing is playing around with the state of their activation subscriptions. It is really as simple as that.

    Instance subscriptions are always generated by the runtime. Instance subscriptions are created by orchestrations which have a non-activation recieve on some correlated property. I can tell you how it works now in the runtime, but you should know that this is no guarantee as to how it will work in the future and really, it doesn't matter to you as long as it works. Instance subscriptions are created at the next persistence point after all correlation sets involved in the subscription have been initialized. If you have an activation receive which initializes correlation set C1, and a subsequent send (S1) which follows C1 followed by two more receives on C1, the two  subscriptions would be created when we send the message out on S1 (assuming no internmediate persistence points like a random atomic scope). To go into more details on this, I really need to describe what a persistence point is (assuming you did not see my tech-ed talk which talked in detail about these). I guess I will have to save that for next time.

    Note that convoys are as always a bit wierd and you should read the post on convoys to understand how their subscriptions are generated.

     

    Have a good one.

    Lee

  • Is there a pub/sub system underneath BizTalk?

     

     

    Okay, according to the stats, you guys stopped reading when I posted the convoy topic, but, well, I'll keep throwing stuff at you and hope you enjoy it.

     

    This is the first installment of the bizTalk pub/sub infrastructure talk. I have met lots of people who are confused about this (some of them are on my team). So the short answer is yes, BizTalk processing is built on top of a sql based pub/sub infrasctructure which you know as the messagebox. The longer answer involves explaining how you interact with it since BizTalk does not tout itself as a pub/sub product or really expose a lot of views into its pub/sub nature (except for the subscription viewer which is a good demonstration of why I am not a UI developer J ). I will attempt to give you a quick insight into pub/sub in BizTalk. Maybe one day I will take all this stuff I am writing, fancy it up and make it into chapters in a book or something, but given that I write code not prose, it’s doubtful.

                There are really two components which together make the BizTalk pub/sub infrastructure. The database portion is known as the MessageBox. The other portion is what the engines internally use to interact with the MessageBox and is called the MessageAgent. This piece abstracts away all of the guts of the messagebox from the engines (things like multi-messagebox are understood by the agent, but the engines do not need to worry about it). This is probably a bit more than you needed. So in a pub/sub system, well, there are really three things you need to describe: publishers, subscribers, and events (or messages) which flow through the system.

                Publishers. Who are the publishers in your BizTalk system? There are only really 3 publishers in a BizTalk system. Receive ports are publishers. They pick up data from somewhere based on the adapter and the URI, and then pass it through a pipeline and maybe a map and then eventually give it to the messageagent and say “Publish this for me, please.” (our engines are very polite and always say please). To clear up confusion, data is persisted to the messagebox after the pipeline and after the map, not before. (There is one exception to this involving MSMQt and large messages but you really don’t need to care about that) Send shapes in a schedule are publishers. They also give the message to the message agent and say “Please publish this for me. Thank you.” There are also a couple of other random points like the response portion of a solicit-response sendport. It does a publish. Also when an orchestration execs (not calls. Call is inline and synchronous) another orchestration, it publishes a message too. It is a bit confusing, I know because you are saying to yourself, “In my orchestration I bind my send action to a specific sendport. What do you mean it publish?”. I will get to that shortly.

                Subscribers: Who are my subscribers? Orchestrations are subscribers. Any receive action in an orchestration maps to a subscription. The orchestration subscriptions are made up of the filter expression on the activate receives and the correlation sets on the subsequent receives (you can see them in your subscription viewer in the sdk\utilities directory). There are two types of subscriptions which I like to talk about … activation subscriptions and instance subscriptions (sometimes correlation subscriptions). Activation subscriptions start a new instance of a service. These are the subscriptions in your orchestration which you mark as Activate = true. Instance subscriptions, or correlation subscriptions, are subscriptions which route messages to already running instances. They are created after the orchestration instance starts once the necessary correlation sets have been initialized. It gets tricky with convoy semantics, but I don’t think I can really explain that in a quick blog like this. Let’s just say that I get tricky and in a parallel activation convoy, they’ve got a little activation and a little instance subscription in them. Send ports are subscribers. Send port subscriptions are always activation subscriptions. There is one exception to this and that is ordered delivery sendports. I’ll let you in on a secret. We do ordered delivery in sendports just like you try to build it in your schedules. We use a convoy. So MSMQt sendports are inherently on a convoy, so they are that weird blend. Other subscribers are the response portion of a request / response receive port. We use some internal correlation sets to make sure that the response gets back to the correct nt service for things like HTTP so that we can send the response on the open connection. Another example of a subscriber is when you do delivery notification. We actually create an internal subscription for the notification and use an internal correlation set to get it back to the correct orchestration instance. Hmmmmmm. What else. Oh yeah. About the confusion over direct binding to a sendport. All sendports subscribe to messages sent directly to them (you can see this in the subscription viewer) based upon a property called their TransportId which is an internal bts property. This way we can force send messages from the orchestration to the sendport. That’s the basics of it.

                Events: Events or messages are just your messages. The MessageBox and MessageAgent do not care at all about what is in your message. We never look at the contents. To us, it is an opaque blob. We care only about the structure of it … how many parts and what there names are … and the properties associated with the message in its context. There are two basic types of properties on the context: written properties and promoted properties. They are both streamed out to the database when the context is persisted. The only difference is that promoted properties are used to route on. If someone subscribes to “foo = 3” and you promote foo with a value of 3 then your message will go to the subscriber. Anyone can promote anything (almost) at anytime as long as the property is defined in a property schema. If it is not defined in a property schema, you will get an error when you try to promote it. One thing many people don’t know is that the routing layer supports multi-valued properties (ie VT_ARRAY | VT_??), Our native components won’t promote anything like this, but you can do it in a custom pipeline component. You cannot reference these properties from within a schedule because orchestrations do not support multi-valued properties, but if you just want to route it there, and you have repeating elements, this could work for you.

     

    There is really only one big gotcha in the routing layer that you have to avoid. Do not create a lot of sendports subscribing to the exact same thing without using a Distribution List (SendPort Group). If you create 100 sendports and all of them subscribe to A=5 & B=4 your performance will be worse than if you were to create one sendport group with that filter expression and then add all of the sendports to it. This is very important. If you do it the bad way, you will see some performance degradation and increased CPU utilization on your master messagebox for routing. Just giving you a heads up. It probably won’t happen till you have a whole lot, but it is just not a good practice to get into. Basically if you have more than say 8 sendports subscribing to the exact same thing, use a sendport group.

     

    Hope this has given you a bit of insight. Apparently there are people out there reading this. Sorry it is not all official and beautiful, but, well, hopefully it is something.

     

    Lee

  • Convoys

    CONVOYS

     

    From the BizTalk Documentation:

    When a group of correlated messages could potentially be received at the same time, a race condition could occur in which a correlation set in a particular orchestration instance must be initialized by one of the messages before the other messages can be correlated to that orchestration instance. To ensure that all of the correlated messages will be received by the same orchestration instance, BizTalk detects the potential for such a race condition and treats these messages as a convoy.

     

    Since I helped proofread the docs (and write this paragraph), I figure I can cut and paste some of them. I spent a while at teched talking to my friend Marty about convoys and what they really are so lets see if we can get this message broadcast out there. Convoy processing is not a feature of BizTalk which you should be looking at and saying, “hmmmmm. I wonder how I can use this convoy thing.” A convoy is “something we support”. Convoy is a term which we use to describe a class of application protocols, specifically it is a set of application protocols which have a race condition as described above. Let’s take an example. Say you are a hospital and want to have a service which handles all information about each patient. For a given patient you have three types of messages, an admittance message, status messages, and a discharge message. If you look at your protocol, you will have built a service which just receives. Now let’s think about what could happen. Let’s say you send the patient admittance message and it goes through the system using maybe a synchronous protocol like HTTP. That means when you get the 202 back, you know the message has been delivered. But what if the BizTalkServer host which is actually supposed to process the message hasn’t started yet (ie the nt service is stopped). Maybe you had a power outage, maybe some intern decided to “hit this button”, who knows. So the orchestration instance which is supposed to handle all messages for patient X has not physically started. The message is in the database (MessageBox) and if you look in HAT you will see the orchestration is marked as ready to run, but it can’t start cause there is no where for it to start. Now lets say you send a patient update. Let’s think about what BizTalk could be doing (okay actually is doing) to support general correlation. So when a message initializes a correlation set, the service instance which the message is delivered to has the responsibility of communicating to the underlying pub/sub system (the MessageBox) that it is now expecting subsequent messages which will follow the same correlation set (based upon what other receive actions you have with follow the same correlation set). In a simple conversation style protocol (ie, you say “Hi”, I respond by saying “Howz it goin’”, and then you start talking), I can create all required subscriptions for subsequent messages when I do the next send which follows that correlation set. The protocol indicates that the original sender will not send any more messages until he gets the response back from me acknowledging the start of our conversation. There is no race. I transactionally create the subscriptions at the same time as sending the next message which acknowledges our conversation so you can’t possibly send a message which I am not ready for. Again, there is no race here. Now lets look back at my original convoy scenario. As you can see, the service instance which will handle all data for that patient does not communicate back to the sender. Heck, it might not even really know who the sender is. So there is this race, since you cannot depend on the service instance to be created and have a chance to turn around and create the subscription before subsequent messages are sent. *We solve this*. When you think of convoys, I want you to not think of BizTalk features, but rather think of your application / business protocols and say, “Hey that’s a convoy. That’s alright though, BizTalk will detect it and handle it and it will just work.” Convoy is not something you try to use, it is something you just end up using because that is how your business works and we support it. To give you a bit of insight, what happens is that our compiler will detect this race condition and communicate it down to the messagebox at enlistment time. Normally enlistment just creates the activation subscriptions (subscriptions which start the new instances), but in this case it will actually create some sort-of templated subscriptions and link them together around this convoy set. Hence, when the first message comes in, the data which coincides with the convoy set is pulled and we essentially complete the templated subscriptions so that any subsequent messages which match those subscriptions and have the same convoy set properties will go to the newly created instance, even if it hasn’t started yet.

    So what is the real difference between a parallel and sequential convoy? Okay, well there are a number of restrictions imposed by the orchestration runtime which basically require your protocol to make sense, but if you want to know the real guts of it, all it really is is telling the pub/sub infrastructure which of these templated subscriptions can initialize the convoy and which require it to already be filled. So if you think about, when the first message comes to the activation subscription, it has to fill out the convoy set for its specific data. Then subsequent messages which match other subscriptions will find there data already filled and so go to the right place. Somehow the pub/sub layer needs to know which of the templated subscriptions can actually initialize the convoy set based on the message values and which of the subscriptions simply requires the values to have already been filled. So for a parallel convoy, any subscription can activate the convoy. For a sequential convoy, only the first receive action can activate the convoy. If a message comes for the second receive before the first, it will fail routing. Again, this is just a feature of your business protocol. If you know what order the messages are required to come in, then you can have a sequential convoy. If the messages can come in any order, you might have a parallel convoy. It is really that simple.

                Sorry this isn’t the most well formatted blog, but I felt like I have been slacking (okay I have been slacking) and wanted to get you some more stuff to fill your head with. In my defense, I decided two and ½ weeks ago to do an Olympic distance triathlon and so had to try and pack training into 1 ½ weeks (especially after rolling my ankle last Monday. For those of you concerned, I finished and did pretty well for a guy who only trained for a week and a half (www.racecenter.com/hagglake).

                Adendum to my last posting on zombies. I looked into it and discussed it with a coworker of mine and I was wrong about the “virtual zombies”. The orchestration engine does some trickery and ends up causing me to create full-fledged zombies, so if a schedule is going to complete in the “Completed with discarded messages” state, the zombie wmi event will always be thrown. Sorry for the brief bit of confusion around that.

                Possible topics for next time, more convoys, when to use mappings, or my favorite, a discussion on the rising cost of rotisserie chicken in the cafeteria. Perhaps a tournament challenge to guess the price each week with winner getting the question of his/her choice answered (I am not a disgruntled employee, I can handle 25 cents, it is just a running joke between myself and some co-workers).

     

    Thanks
    Lee

     

     

    Further reading:  http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sdk/htm/ebiz_prog_orch_iljk.asp

     

  • New good blog out

    Be sure to check out Kevin Smith's (kevinsmi) blog which appears to have just come out. He's an ok chap, although I think he misspelled my last name on purpose (since we shared an office for a year you'd think he'd know how to spell it). Hopefully this weekend I'll put together another good little paper on some topic of interest.
  • Completed with discarded messages??? (zombies)

    Now that I have gotten my post approved by Scott (Woodgate) (check out his blog, too, for more info. He actually seems to know a thing or two about this product. :) Kidding ... he knows a lot, I just like to pull his leg.) here is my first post with real content.

    This is a discussion we have had recently trying to figure out what we can do for this problem, how prevelant a problem it is and giving a much clearer understanding of what the problem is. I have seen numerous postings on different internal and external aliases about this topic ... some complaints, some questions, some random musings. Well here are my random musings. :)

    I admit that sometimes my posts might go over the head of BTS newbies. Hopefully it will get you thinking about some things and when you do hit issues, you will know to come back and reread this and everything will all of a sudden become clear. One thing I want to make very clear is that when I talk about things like what we could do, I am simply musing and looking to get feedback from you as to what you think of these ideas. You are, afterall, our customer. However, none of this means that any of this will actually come to fruition. There are priorities and timelines and resource limitations with every project and management people who make tough decisions on what to do and what not to. So please just consider this a forum and not any type of promise or look into what we are doing in the next release.

     

    First, an orchestration instance gets into the “completed with discarded messages“ when it reaches a terminate point, but messages which were routed to the schedule have not been consumed. To us, this looks like message loss so we get nervous and suspend your orchestration instance and ask you to take a look at it. If you don't think it is message loss, you can then go ahead and terminate the instance. Internally we call these zombies.

    Why do zombies occur. We will break them down for now into three categories:

    1)       Terminate control messages – the protocol allows for some type of control message to be sent which basically cancels all currently running work in a specific orchestration instance. Since this is a protocol level control message, the user actually wants to just kill everything and tends to handle the zombies by simply terminating them. A number of Human Workflow related designs tend to use this mechanism as well as various other designs.

    2)       Parallel listen receives – in this scenario the protocol waits for 1 of n messages and when it receives certain messages it does some work and terminates. A listen is truly designed to handle the case where only one message is going to come. It is not meant for ... well once one message comes you can do some stuff and finish cause I don't really care about the other message. Hence, if those messages are correlated together, it is possible to terminate as we are receiving a message for a different branch. The scenario for this is somewhat less clear. I image users would still tend towards just terminating the running instance.

    3)       Sequential convoys with non-deterministic endpoints – this is for cases where the user builds a master schedule to handle all messages of a certain type in order to manage some type of protocol requirement. The typical ones are ordered delivery, resource dispenser, and batching, although I am sure there are others. In these cases, the users tend to define a while loop surrounding a listen with one branch having a receive and the other having a delay shape followed by some a construct which sets some variable to indicate that the while loop should stop. This is of course non-deterministic since the delay could be triggered, but a message could still be delivered. Non-determinstic endpoints like this are always prone to zombies.

     

    What options could we provide for the above scenarios:

    1)       We can keep what we have which is that when a zombie is detected, we suspend the orchestration and require the user to investigate and act upon the instance appropriately. We currently fire a WMI event so they can script this to a degree, although that WMI event is only fired for zombies triggered by the messagebox. For virtual zombies, which are cases where a schedule is terminating, but looks in memory and sees that it has messages loaded for processing to a specific receive shape (subscription), but it has not processed them. In that case, the orchestration instance will suspended itself and a suspend event will be fired, but not a zombie event. In either case, you would catch the WMI error and signal someone to take a look

    2)       Allow the user to specify a property on the receive shape which indicates whether or not we should worry about messages delivered to this receive which are unconsumed when a schedule terminates. If the user says not to worry, then at terminate time, we discard those messages and if nothing is left, we cleanly terminate.

    3)       Allow the user to specify a property on the terminate shape which indicates if it is a hard terminate or not. If it is a hard terminate, all unconsumed messages are simply discarded and we terminate cleanly, regardless of any properties at the port level. This makes a lot of sense for the terminate control message case.

    4)       Allow a property to be set at the schedule level which simply says “no zombies” which in essence sets the discard property for all receive shapes.

    5)       Drain – This is the most complicated one and applies mostly to cases which fall into the third bucket from the “why” list. In this case we allow the user to continue looping over a receive shape, but they have somehow disabled that receive and are now draining it. Once there are no more messages available for that receive, it would then follow through and terminate. Much more technically difficult and I have no idea how to build something user friendly to do this but solves a much different problem than the above cases.

     

    How might we expose these options:

    1)       No change … this is what we do right now

    2)       A new property on the receive shape which gets translated into an attribute of the subscription by the compiler.

    3)       A new “property” on terminate. So now you have terminate “Control message triggered termination” hard

    4)       Potentially a setting at the orchestration level. Not sure.

     

    The other thing to consider for all of these settings is that not only do we want to consider exposing these at a designer level for the developer, but the administrator might also like access to this type of information and control. A developer might not fully understand the protocol or understand the IT policy for handling these cases and so would be very hesitant to set a hard terminate or discard messages property. However, the IT admin would notice that a specific orchestration type is causing a lot of zombies and would like to be able to adjust the “zombie policy” without recompiling the orchestration. Not really sure about this.

    5)       Okay, I am not completely sure how we would express this. This is the most non-trivial one.

     

    Did we solve anything:

                So the first two cases are problems that are a bit easier to understand what the goal of the customer is. The new flags we would be exposing would be simply about reducing that amount of management infrastructure they would need to build around handling their orchestrations. There is a workaround right now to catch the wmi events and act upon them programmatically, but that might not be enough depending on how much detail is provided in the suspend event fired for non-messagebox triggered (virtual) zombies (as described above). The other thing to consider is if you have a set of linked orchestrations communicating with each other and one is terminated by some type of control message, the other schedules might continue to try to communicate with it and so get routing failures. Obviously this is what the customer wants to a degree, I am just not sure what we would be expected to do in this case.

                So in the third case, it is all about why they have a non-deterministically terminating schedule. The most common cases are the ones I listed above (order, resource dispenser, batching). If we look at these, we have to ask have we helped the users by adding drain. For ordered-delivery, probably not. There would be a new race condition introduced which would allow multiple orchestration to execute simultaneously on the same correlation set messages (one is draining while the other just got kicked off from a new message). This would break the point of their ordered delivery implementation. The better solution for us is to help solve the actual ordered delivery problem and not the zombie problem. For the resource dispenser paradigm, it depends on how strict the limitations on those resources are. If they are rigid, then it has the same problem as ordered delivery. If they are slightly more flexible and can handle an occasional blip of one or two extra running scheds, then this would probably help them.  For batching, this could help a lot. In fact, this is pretty much what batching wants … to gather up a certain amount of data in one schedule and put it together into one message using some type of logic and send it on. Of course this would only be usefull for not particularly strict, time based batching systems. For systems which require batching of exactly 10 messages we would need additional mechanism to make a more deterministic cutoff point for the start of drain. I am not sure how often that is the case. I think the basic key is to understand all of the cases where customers are using non-deterministic terminating protocols and see if we can help in those cases. I am not sure if we try to only attack the zombie problem for this case we will be able to solve anyones problem fully.

    Hopefully this thread has given you a bit more insight into why you get orchestration instances in this “completed with discarded messages“ (zombied) state and why we are not always sure what to do. If you have feedback as to why you are getting hit by this, please send it on to me. Since I am a developer, :), this is my outlet to talk to customers and the field.

    Hope this has helped some and also feel free to send requests for topics of discussion. Again, I have a job which requires a lot of time, but getting the word out is also part of my job so I will try to make time. Just don't get mad at me if it takes a while or if your topic doesn't make it cause it is way outside of my field. If there is enough demand, I will yell at some of my co-workers and get some more blogs going. Have a good day.

    Thanks
    Lee

    PS: Upcoming post ideas seem to be: “Mapping: Receive / Send Ports or Orchestrations? Why and when?”, “What is this convoy thing people keep talking about”, “Is there a pub/sub system in BizTalk?”, “How do I debug routing failures?”, “What is an orchestration persistence point and why should I care?”, “How does this delivery notification thing actually work?”, “What are service links and roles?”. These are just a couple of ideas. Feel free to send me your ideas. (I have a bad feeling that I might get blasted with that statement, but it will be good to know what people want more info on). Thanks.

  • I'm back. :)

    Okay, wow apparently people are out there waiting for some more BTS blogs. So here we go.

    First, who am I? My name is Lee Graber. In August I will have been working as a developer on BizTalk Server for 5 years. In previous releases I owned (a non-MS friend recently commented on the oddity of using the word own to describe my work, but it's what I say) a lot of the core messaging engine including the interchangeSQ db, IInterchange object, the work scheduler for both processing and receiving, overall performance, and various other things. In this release I own the messagebox and everything that entails (subscriptions, convoy, a lot of perf), bits and pieces of the messaging engine, our caching infrastructure, and I am, as of a month ago, the proud new owner of tracking from the runtime side. Yeah. :) So these are areas I have a lot of knowledge in but I am probably good for most of the basics of the core platform. If you went to Tech-Ed 2004, you might have gone to my performance talk. If you didn't, you should get the DVD. :) Oh yeah, I have an obsession with smiley faces. I think for my first real post, I am going to cleanup a recent email I sent internally about zombies (schedule instances marked “completed with unconsumed messages”) and we will see what kind of feedback you guys can give me. Hopefully I will be recruiting a couple of co-workers like Jean and Paul to post and maybe we will get 2 or 3 insightful posts a week. I guess we will see how this pans out.

    Thanks for using (or considering using) our software and I hope this helps

    Lee

  • Testing

    Okay, this is just a test to see if this is actually working. I am trying to create a blog where some of the core engine guys from the bts team can post thoughts on design paradigms, explain feature decisions, and I haven't figured out exactly how this feedback thing works, but I guess we get feedback. Since the only guy I saw “blogging” was Scott Woodgate, I figured perhaps a bit more info out there would be good. :)  We should be able to get postings on here from the guys who wrote the orchestration engine, messaging engine, messagebox, some of the adapters, ... . We'll see how it goes. Okay, lets see if this works.

    Lee

More Posts « Previous page

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker