I finished the code for this project months and months ago, and I had every intention of writing a full MSDN article describing the ins and outs of what I’d accomplished, but time seems to have gotten away from me. Rather than let the code languish any longer, I’ve decided to simply write up a (relatively) short blog post, making the code available for all to explore.
For those of you who have read my previous MSDN articles about Media Center, you know I’m an avid fan. I’m also really interested in exploring and expanding on the developer scenario for working with Media Center as well as with the DVR-MS files in which it saves recorded video. There is so much interesting information there, be it the actual video or audio content, or be it the metadata surrounding that content. However, there is a large amount of information available in DVR-MS files that has largely gone unnoticed and unsung, and with this post I hope to rectify that inequity.
Closed captions are an amazing source of information about recorded television shows. The metadata headers in DVR-MS files provide a great deal of information about the show, most of it based on information captured from the electronic program guide. But closed captions detail everything that goes on in a show, the lines that were spoken, interesting dialog, when music is played, and so on. Being able to harness this data makes it possible to write a wide variety of just plain cool applications that make watching and working with recorded television much more enjoyable.
To exemplify this, I’ve written a managed class library in C# (extending that which I created for my original Fun with DVR-MS article on MSDN), that parses and exposes all of the closed captioning data from NTSC non-HD DVR-MS files in a developer-friendly fashion. I’ve then layered on top of this library a few sample applications to demonstrate the power of such a library and the types of application you can have fun writing yourself.
How do I use the library?
You can download the complete sample source code and compiled binaries here. It includes Toub.MediaCenter.Dvrms.dll, which provides the classes for parsing out the captions and working with them. The central class to this effort is the abstract class ClosedCaptionsParser, available in the Toub.MediaCenter.Dvrms.ClosedCaptions namespace. From this class derives the concrete NtscClosedCaptionsParser class which is used to extract the closed captioning data from NTSC files (you’ll notice that the library also includes a PalClosedCaptionsParser, but if you look inside it, you’ll see that it’s simply a shell waiting for you to implement it ;) To use an NtscClosedCaptionsParser, one simply instantiates the class, passing to the constructor the path to the DVR-MS file whose captions are to be extracted, and then calls the parser’s GetCaptions method. GetCaptions returns a ClosedCaptionsCollection, which contains the parsed ClosedCaption instances. Each ClosedCaption instance exposes four properties, three of which are TimeSpans and one of which is a string. The string is the text of the caption (most likely an individual word or sentence from the captioning), and the TimeSpans represent the times at which the data for the caption began to be received, was displayed to the screen, and was cleared from the screen. The NtscClosedCaptionsParser actually creates instances of a class derived from ClosedCaption, NtscClosedCaption, that provides two additional properties. The first of these additional properties is of the enumeration type NtscClosedCaptionType and describes the type of the caption: roll-up, pop-on, or paint-on. To understand the distinction between these types of captions, I suggest you read the free Code for Federal Regulations document 47CFR15.119. This will provide you with a PDF of the subsection titled “Closed caption decoder requirements for analog television receivers,” which is part of the book covering the Federal Communications Commision (FCC) and the section covering radio frequency devices. The other property exposed by NtscClosedCaption is Channel, which tells you with which channel of captioning this particular caption is associated (you’ll often find multiple languages on different channels interwoven, and this allows you to separate them out from each other).
So, as an example of how this can be used, here is a snippet of code that parses the captions from a DVR-MS file and writes them out to the console in a tab-delimited fashion:
NtscClosedCaptionsParser parser = new NtscClosedCaptionsParser(filename);
ClosedCaptionCollection ccs = parser.GetCaptions();
foreach(NtscClosedCaption cc in ccs)
cc.StartTimecode, cc.DisplayTimecode, cc.ClearTimecode,
cc.Text, cc.CaptionType, cc.Channel);
A Few Words About NTSC Closed Captions
A brief tour of NTSC closed captions is probably appropriate. Note that there are a lot of intricacies involved with parsing and rendering closed captioning data, and I’ve chosen to ignore most of them. Thus, I’ve coded up a simplified parser that just deals with the text included in the stream (most of the specification is dedicated to the commands that are also included in the data, which dictate actions such as moving the cursor around the screen and changing the font color). This works very well for most, if not all, of the recorded programs I’ve tested it with, but there could very well be some recordings that cause this code to blow up. If it destroys your computer, I take no responsibility (though I would like to know about it, as I’d find it pretty amazing).
DVR-MS files are created by the Stream Buffer Engine (SBE) introduced in Windows XP Service Pack 1, and are used by Media Center for storing recorded television. They contain metadata describing the contents of the recording (title, episode title, actors, etc.) as well as a variety of data streams. A typical DVR-MS file is made up of two or three of these streams, depending on what version of Media Center created it, the type of signal being recorded, and the actual data being recorded. Most, if not all, DVR-MS files will have both an audio and a video stream. The third stream is the one that may or may not exist, though it will for almost all NTSC content recorded with a recent version of Media Center. This third stream contains the closed captioning data for the recorded television show. As with the audio and video stream, the closed captioning stream is encrypted/tagged, so it must first pass through a Decrypter/Detagger filter. If you use GraphEdit to look at the Out pin on this filter, you'll see that its major type is AUXLine21Data and that its subtype is Line21_BytePair (this may vary based on whether the content is NTSC, PAL, HD, etc.). Closed captioning in NTSC television shows is encoded into line 21 of the Vertical Blanking Interval (VBI).
There are a few approaches one could take to extract the closed captioning data from the CC stream. One approach would be to hook up a Dump filter to the CC stream, saving the CC stream’s raw contents to a file. This file could then be opened and parsed through a managed FileStream. This approach, while very simple, has some significant downsides. The streams that make up DVR-MS files are actually split up into a series of samples, where each sample contains data but also some metadata about that data. One important piece of metadata for each sample is the timecode at which that sample is supposed to be rendered when the file is played. However, when you use the Dump filter to dump the data contained within a stream, you’re simply dumping the data contents of each sample, not the metadata. Thus, there’s no way by examining the dumped bytes to determine with certainty what timecode each dumped byte corresponds to in the original file, especially if the CC stream isn’t continuous (for example, if commercials in the show aren’t captioned).
There’s another way to access the closed caption data, and this approach provides full fidelity, as it provides you access to the metadata for each sample. The Windows Media 9 Series Format SDK, provides classes and interfaces that allow you to read Windows Media files using synchronous calls. These interfaces can also be used with DVR-MS files, and as such the Format SDK allows you to process DVR-MS files using the IWMSyncReader interface. The WMCreateSyncReader function is used to create one of these synchronous reader objects.
In order to parse out the closed captions, I use an IWMSyncReader to find and walk through the closed captions stream in the DVR-MS file. This can involve reading in and looking through gigabytes of data, which means that this isn’t currently a fast process. In fact, to parse the captions from a half an hour show can take a minute or more on my laptop (though I’m running on battery power right now, on the plane on the way back from PDC 2005, so that might be slowing down the process further).
The closed caption stream is divided into a series of two byte instructions, some of which are data and some of which are command codes that detail how to process the data. For example, one command might inform the processor to render the current caption to the screen, and another command might ask the processor to clear the currently displayed caption. My parser is a very simple state machine that loops through the data looking at each byte pair, processing and reacting to each as they occur. There is a whole list of command codes which I’ve special cased in a switch statement: if any of those are found, the command itself is ignored and instead a space is added to the output text. If the command is 0x942f (end of caption) or 0x942c (erase displayed memory), it is treated as the end of the current caption, and whatever text seen up to this point is stored into a new ClosedCaption instance. This instance is then added to the collection of captions, and the text buffer is erased to prepare for the next caption. With the exception of special characters (which occupy two bytes and begin with either 0x11 or 0x19, both of which I’ve ignored and treated as commands) such as musical notes and registered marks, each text byte is inclusively between 0x20 and 0x7A and represents its ASCII equivalent. As such, if a byte isn’t part of a command, I simply cast it to Char and add it to the current text buffer (which will eventually be stored in a ClosedCaption instance when an end of caption or erase displayed memory command is received).
Two-bytes of closed captioning data are sent with every frame of the video, and so determining the time code at which to display a caption can be calculated based on in what frame it was sent (frames that contain no useful closed captioning will contain 0x8080 as a filler). There are two ways to compute the time code based on the frame: dropframe and non-dropframe. Non-dropframe is computed simply by dividing the frame number in which the closed caption instruction was sent by the number of frames per second; for NTSC, this is 29.97 frames per second, and for PAL, 25 frames per second. Dropframe, on the other hand, is used for most broadcast signals and attempts to account for the non-integer 29.97 value using a scheme similar to that employed for leap years (i.e. the earth actually spins 365.25 times in a year, so rather than worry about the quarter day each year, it’s simply rounded off and turned into a 366th day once every four years). Instead of computing the time code based on 29.97 frames per second, it’s computed based on 30 frames per second. This results in 18,000 frames in ten minutes, as opposed to using 29.97, which results in 17,982 frames in ten minutes, a difference of 18 frames every ten minutes. So, when computing timecodes with dropframe, the first two frames of each of the first nine minutes out of ten are ignored, thus eliminating the problem of having an extra 18 frames. Thus, if you use non-dropframe to decode a broadcast that used dropframe, by the end of each 10 minute cycle the timecodes could be off by a little more than half a second. For simplicitly, I’ve decided this is an acceptable discrepancy and have coded my timecode computation method to use non-dropframe. Feel free to change it if this bothers you.
To help with the speed issue I mentioned previously, by default the ClosedCaptionsParser caches a serialized version of the parsed ClosedCaptionCollection into a NTFS Alternate Data Stream associated with the DVR-MS file. After the captions have been parsed successfully once, by default any attempts to parse the captions in the future will first attempt to retrieve the cached captions from this stream. So, while this operation is very fast on all subsequent parsing operations, parsing a DVR-MS the first time can cause significant wait time.
For me, one of the most interesting scenarios this capability presents is enhanced navigation. Most folks today with PVR capabilities are stuck in the television-watching mindset of fast-foward and rewind. But what about search? Search is huge! What if you could jump to a place in the video where a particular line was said? What if you want to show someone that really funny joke in the episode of Friends you recorded last night? Closed captions make that possible.
I’ve dubbed this first sample application I’ve implemented “Search and View,” and it does exactly what its name implies. The application hosts the Windows Media Player ActiveX control in order to play a DVR-MS file. When a file is selected to be played, the captions are parsed from the file and are displayed in a list box, allowing individual captions to be selected. When a caption is double-clicked, the video jumps to the location in the video where that caption was displayed to the screen, allowing you instance access to any spoken dialogue in the video. Moreover, there’s a search box on the form that provides very simple searching capabilities. You can enter a search term and have the list of captions narrowed to only those captions that include the search term. Obviously, there are a plethora of ways in which this application can be expanded upon and improved, but I happen to think it’s pretty darn cool as is. Thanks to Derek Del Conte for working with me to flush out the idea for this sample.
If you’re like me, you record many episodes of a few different shows, and sometimes it can be difficult finding the show you’re really interested in watching. What I really needed was a way to do an intelligent full-text search on all of the videos on my hard disk in order to narrow down my files to only those I’m interested in... oh wait, I already have that: Windows Desktop Search. In order to use Windows Desktop Search to allow me to search for recorded videos based on that funny dialogue I want to play for my fiance, I implemented an IFilter for DVR-MS files. This IFilter is written in managed code using COM interop to expose the necessary functionality to Windows Desktop Search so that it can index all of the closed captions contained in my recorded DVR-MS files. I can now simply type into the Desktop Search text box a phrase from a show I previously recorded, and voila, I’m instantly provided with the DVR-MS file I should view. Wow.
Note: I mentioned earlier that it can be very slow to do the initial captions parse for a DVR-MS file. The way it’s currently implemented, this can cause problems for Desktop Search, which expects the IFilters it uses to be timely in their responses. If an IFilter takes too long to process a file (some number of minutes), Desktop Search does the right thing and assumes the IFilter has hung, aborting the indexing for that file. Additionally, the way I currently parse the file, I do all of the parsing in one fell swoop, which precludes Desktop Search from throttling the indexing; a robust IFilter would handle this much better, but, well, this is sample code. The IFilter will run very quickly if the captions have already been cached from a previous parse. Also note that the Desktop Search team recently released their own sample for how to implement managed IFilters.
Saving Captions When Converting To Other Formats
In my Fun with DVR-MS article, I demonstrated how it’s possible to use DirectShow to convert from DVR-MS files to other media formats such as WMV and WMA. However, those samples did not preserve the closed captions. And how could they; after all, WMV and WMA files don’t contain streams for captions, right? They do something even better. Both WMV and WMA files allow you to use the WM/Lyrics_Synchronised metadata header to store a collection of strings, each of which is associated with a particular time in the video file at which the string should be displayed. Sound familiar? Synchronized lyrics are visible in two different ways in Windows Media Player. The first, and most obvious, is through the synchronized lyrics editor that��s part of the Advanced Tag Editor in Media Player. But these lyrics wouldn’t do any good if you couldn’t view them along with the media at the appropriate time. In fact, you can. If you enable captions/subtitles for a video (the option is available from the Play menu in Media Player), Media Player will show you the synchronized lyrics along with the video at the appropriate time. So, I’ve augmented the ConvertToWmv and ConvertToWma applications I provided in the Fun with DVR-MS article to extract the closed captions from the original DVR-MS file and to save them as synchronized lyrics into the metadata headers for the generated Windows Media files.
Summarizing Video Files
This one is admittedly a bit far fetched, but I still think the idea is neat and wanted to see how it would fair. Microsoft Word has the ability to provide automated summaries for documents. You specify how much the text in the document should be summarized, Word analyzes the textual content, and it provides a new document that is some percentage in size of the original text (25% by default, I believe). Wouldn’t it be neat if we could do the same thing for video files? I decided it’d be fun to use Word’s AutoSummarize feature to implement this for videos. First, I extract the captioning from a DVR-MS file. I then programmatically dump that textual content into a Word document and ask Word to summarize it for me. The summarized text is then mapped back to the original captions (in a slightly haphazard fashion, I admit) as parsed from the file in order to determine which captions should be kept as part of the summary. The RecComp class, as described in the Fun with DVR-MS article, is then used to create a video summary including the segments that contain the summary captions. Useful? Unsure. Cool? Yup, or at least I think so.
At over 3500 words, this post didn’t end up being as short as I’d planned, but hey, the more the better I guess. There are so many neat things you can do with captions, I’m sure this only scratches the surface. Some additional ideas for things I’d implement if I had the time:
I’d love to hear what else you come up with and any feedback you might have (again, though, this is all unsupported, so while I’ll try to help where and when I can, I make no guarantees about anything). In the meantime, I hope this is helpful.