This is a question that has come up in a few customer engagements, and on several forum posts – how do I find the top category in a stream for a given group? Several examples of this pattern of question are:
Writing a StreamInsight query to answer this question typically follows one of two patterns:
Note: all of the source code, sample data, queries etc for this blog post may be found here. Please download and follow along.
The 2-step process for finding the Top X is detailed below. For the purposes of this example, we’ll be using the question How do I determine the top 2 most popular categories in the past 8 hours with some sample data and results.
Ok, sounds reasonable – what does this look like in terms of a StreamInsight query?
This query creates the initial aggregates, being the window of time (start time / end time) and a set of events containing the SKU category and count.
Then we’ll take a snapshot of the initial aggregates window, sort by the count and take the top 2 results.
Resulting in the answer to our question:
Putting it all together:
The 3-step process for finding the Top X by Grouping is detailed below. For the purposes of this example, we’ll be using the question How do I determine the most popular category for each user in the past 8 hours with some sample data and results.
This query creates the initial aggregates, being the window of time (start time / end time) and a set of events containing the user Id, SKU category and count.
Next we’ll re-group this window of events before performing the order by / take X:
This logically splits the window of events up into groups based on the user ID.
Finally, we use the ApplyWithUnion operator to perform an order by / take operation on each individual group, then union the results back together.
This syntax can be a little overwhelming at first glance, so let’s break it down and examine what’s going on.
StreamInsight provides rich capabilities for ranking (TopK) events within a window (or groups within a window) by using the OrderBy and TopK (Take) methods. These can be used, in conjunction with the appropriate windows and aggregates to find trends, “hot” patterns and other interesting occurrences within event streams.