In a prior blog post – http://blogs.msdn.com/b/appfabriccat/archive/2010/11/03/streaminsight-query-pattern-find-the-top-category-using-order-by-take-and-applywithunion.aspx– I discussed how to determine the top category in a set of events using order by and take. However, I only covered non-overlapping aggregate windows (i.e. Tumbling windows). I did not cover how you have to adjust your query definition for overlapping (Hopping) windows. Since this question came up in the forums, and it’s somewhat challenging to post a picture therein, here’s how you fix the problem.
Let’s start with our “working” top category query.
What is actually happening here in terms of the event flow is described in the diagram below. The input stream, weather, is broken up into substreams by the station code, which then have Tumbling windows applied over top of them. This calculates the number of events in each category, which is then passed to the snapshot window (and then the order by and take).
This works so long as the input events to the snapshot window do not overlap. If they do, older “calculations” can be counted against the new ranking. For example, if a Hopping window is used, such as this query:
The following effect is observed – at points in the event stream the results of older Tumbling windows are passed to the snapshot window for ordering. This can produce incorrect results, as seen in this diagram.
In order to prevent this from happening, we need to stop the output of the hopping windows from overlapping in the snapshot window. There are a number of ways to do this, but the easiest is to convert the output of the Hopping window back into point events via the ToPointEventStream method.
Which will produce the desired results:
Note that the diagrams don’t entirely line up; I’ll polish this up a bit and add some information on tie-breaking before posting to the AppFabric CAT blog site next week.