Editor's Note: Today's post comes from Dmitry Vavilov, Head of Software Development and Consulting Department in LegalSoftWave. Dmitry addresses a very interesting topic – user behavior prediction for content delivery systems, such as TV systems, news applications or any other platforms with broad audience and rapidly changing content.

Vladimir Kolesnikov

Prediction of viewer behavior is studied in frames of software development for TV platforms. Last year the author suggested a simple algorithm for TV content choice (based on analysis of user activities cyclicality and industry specifics). Possible application of this approach to other areas was also proposed. However, the generalized methodology requires consideration of more factors. Content description should be used for the prediction if it is available. The improved algorithm taking into account Electronic Program Guide (EPG) data is described in this article.

  1. Introduction

    Considering perspectives of applications for open TV platforms on SECR 2012, the author conducted a research on simple algorithms for recommendations based on analysis of user behavior and industry specifics [1]. It included studying of cyclicality of the content's broadcasting and viewing. Applying results of analysis to the simplest data on the viewer's activities allows 1) making powerful assumptions on the user's preferences and 2) directing him/her to the most interesting channel (it is very important from the targeted advertisement point of view). Finally, the author described a simple algorithm for the channel choice at the moment when the user would like to turn on the TV set. These tips based only on analysis of previous viewings. In other words, the viewer was not requested to choose channels from EPG (Electronic Program Guide) or answer any questions on preferences, etc.

    The author also discussed [2] possible application of this approach to other areas (including Consumer Electronics solutions, financial software, and enterprise content management systems). Making recommendations for users becomes a very common development task and even a standard component of the software usability.

    Complex data mining algorithms could not be utilized in many cases. Usually they provide high efficiency of recommendations but need comprehensive content description and a lot of records on the user's activities. Sometimes these algorithms are based on synergetic analysis of information gathered from thousands of users. It requires powerful infrastructure maintenance (return channel support, servers for data storage and processing). Often huge data gathering could not technologically possible (the number of users is not so big; return channels may not exist, etc.) Moreover, development and support of complex solutions costs a lot, and high efficiency of tips could not compensate appropriate investments.

    So, the effective enough, cheap, flexible, and easy implemented algorithms are required. They will become the foundation of generalized methodology for simple implementation of user interface self-adjustment and content navigation. However, it requires consideration of more inputs if they are available.

    Content description is one of most important factors. For TV this information is delivered via EPG. It usually includes time frames for each program, its title and some other fields (genre, etc.) The improved algorithm taking into account EPG data is described in details in this article. The subject area (TV industry) allows implementation of the simplest solution (comparing, for example, with web-browsing). But it is convenient to demonstrate the common algorithm steps on this specific task.

  2. Task Definition and Assumptions

    TV viewer's activities are logged by STB (Set Top Box) software in the following format: the channel, the viewing's start time and duration. In frames of discussed below assumptions, it is necessary to save records only for N last days (e.g., 10 days log is considered below).

    The number of channels is limited and the list of channels is predefined and static (it is a fundamental difference from the tasks like web-browsing, etc.; the algorithms adjusted for unlimited choice will be discussed in another article). To be definite, 5 channels are broadcasted.

    If EPG information is available, the correspondence between the TV viewer activity and its description can be retrieved both for previous and current time periods. In practice, one viewer's activity could correspond to more than one TV show (somebody starts to watch Channel 3 at 8 PM and does it 3 hours successively including 5 programs - news, series, etc.); in such case the STB software shall divide appropriate record into several ones (five records in this case).

    For definiteness, EPG information includes time frames for each program, its title and genre (the classifier with limited and pre-defined number of options – sport, news, comedy, etc.).

    It is important that the algorithm can be used both for cases with and without EPG data as well as for different number of available classifiers.

    The algorithm shall provide recommendation on the TV channel choice when the viewer turns on the TV set. The recommendations are based only on analysis of previous viewings (and EPG information when available). In other words, the viewer is not expected to choose the channel or answer any questions on preferences, etc. (it is not sometimes effective to ask the user on the preferences explicitly because it might not be convenient for the user, return channel may not be supported, and adequacy of the answers is not always good enough).

    It should be also mentioned that the described algorithm is flexible enough to provide recommendations for any moment (not only the moment of turning on).

    The following implicit assumptions are based on understanding of the industry specifics and user behavior:

    1)    Cyclicality of the content provider's the user's activities. For example, strong regular programming approach is implemented by the majority of TV channels. The programs are started at the same time and repeated regularly (daily, etc.)

    2)    Correlation between already seen program and the intention to watch the program again is higher if the time interval in days between them is less. In other words, if yesterday the users watched TV 1 channel at 7 PM and the new episode of series was shown, the probability to watch the same channel at the same time is higher in several days than in several months (so it does not make sense saving records on the viewer's activities for more than 90 days, etc.)

    3)    Duration of the user's activity (watching the show, etc.) could be automatically registered and used further as quantitative and qualitative estimation of the viewer's preferences.

    4)    Time of the start of the user's activity is correlated with the group of users (it is especially important when the user's identification is impossible). In other words, it is expected that the same viewer (or group of viewers) watches TV at 7PM (the case when identification is provided and the user profile description is available will be considered in another publication).

  3. Simple Effective Algorithm

    1. Duration and Start Time Info Usage

    So, the STB supports the log of the TV channels viewings for some time (for 10 days in the example). When the user "turns on" the TV set (for example, at 18:53 on November the 11th), the STB defines what channels (and how long) were watched at the same time previously. For example, the following list of such records exists (as defined above, the separate record is created for each TV program when EPG info is available):

    Day

    Viewing

    Channel

    Start time (hours: min)

    Duration (min)

    10 Nov

    1

    18:30

    30

    09 Nov 

    3

    17:55 

    60

    08 Nov 

    1 

    18:50 

    10 

    07 Nov 

    1 

    18:30 

    30

    06 Nov 

    2

    18:48 

    18 

    05 Nov 

    5 

    18:53

    0,5

    04 Nov 

    1 

    18:20 

    45 

    03 Nov 

    2

    18:38 

    24

    01 Nov 

    1 

    18:28 

    30

    1. Log of viewings

    It is noticed that sometimes the viewer could not watch TV at 18:53 (on 2 November).

    In the second step, the rating is calculated for each of the chosen programs. The rating consists of two components that are multiplied.

    The first depends on the viewing duration. Non-linier dependencies are applied here (for this example, 0.001 for the programs watched less than for a minute; 0.01 for duration from 1 minute to 5 minutes, 0.1 for duration from 5 minutes to 15 minutes, 1 for duration more than 0.5 hour).

    Another component depends on the time interval from the day of the viewing till the current day. For a program watched yesterday, this coefficient is equal to 1; for the programs watched 10 days ago, it is equal to 0.01, and so on (non-linier dependency is also used here).

    For example, if squared relationships are used for both components, the following values could be calculated in this case:

    Day

    Rating of viewings

    Channel

    First component

    Second component

    Rating (product)

    10 Nov

    1

    1 

    1  

    1

    09 Nov 

    3

    1 

    0,96 

    0,96

    08 Nov 

    1 

    0,11 

    0,91 

    0,1

    07 Nov 

    1 

    1 

    0,84 

    0,84

    06 Nov 

    2

    0,36 

    0,75 

    0,27

    05 Nov 

    5 

    0,001 

    0,64 

    0,001

    04 Nov 

    1 

    1 

    0,51 

    0,51

    03 Nov 

    2

    0,64 

    0,16 

    0,10

    01 Nov 

    1 

    1 

    0,01 

    0,01

    1. Ratings of viewings

    Then the ratings for the same channels are summarized. It should be reminded, that this set of ratings corresponds only to the specific time (18:53).

    Channel

    Summarized Ratings

    1

    2,46 

    2 

    0,37

    3 

    0,96

    5 

    0,001 

    1. Summarized ratings of viewings

    It is recommended to assign very little (bot non-zero) summarized ratings for the other available channels even if they were not shown (as it is done for channel 4 below). It is necessary due to final step specifics (not to exclude completely these channels from the recommendations).

    Channel

    Summarized Ratings

    Probability 

    1

    2,46 

    0,65 

    2 

    0,37 

    0,097 

    3 

    0,96 

    0.25 

    5 

    0,001 

    0,0003 

    4 

    0,00001 

    0,000003 

    1. Summarized ratings and probabilities

    Finally random choice of the recommended channel is done taking into account these ratings (for example, if the summarized rating of Channel 1 is equal to 2,46 and the summarized rating of Channel 3 is equal to 0,96, the probability to choose the Channel 1 is 2,5 times higher).

    It should be mentioned that the algorithm provides non-static sets of ratings (if the user becomes to watch new series on another channel, it should be quickly reflected due to calculations on the second step).

    It is obvious that this approach completely ignores any information on content. It shows only relationships to the viewed channels but strong regular programming policy of TV broadcasters provides some implied correlation between the time of viewing and the specific TV program (allowing to level lack of EPG information).

    1. EPG Data Usage

    In case of the content description availability, the algorithm should be improved as described below. As defined previously, EPG information includes time frames for each program, its title and genre (the classifier with 3 options – sport S, news N, movies M; one of these values shall be assigned to each program). It is necessary to emphasize that EPG information is expected to be available both for already seen and future TV programs.

    Due to assumption on the users identification impossibility, only genre preferences for the specific moment is considered below.

    For this example the watched programs had the following genres (the time and channel have some correlation with genre):

    Day

    Viewing

    Channel

    Start time (hours: min)

    Genre

    10 Nov

    1

    18:30 

    N

    09 Nov 

    3 

    17:55 

    S

    08 Nov 

    1 

    18:50 

    N

    07 Nov 

    1 

    18:30 

    N

    06 Nov 

    2 

    18:48 

    M

    05 Nov 

    5 

    18:53 

    M

    04 Nov 

    1 

    18:20 

    N

    03 Nov 

    2 

    18:38 

    S

    01 Nov 

    1 

    18:28 

    N

    1. Correlation of the viewings genre with channel and start time

    It is noticed that the viewer did not watch program of some genre at 18:53 (movies in this case)

    Correspondence between the viewings ratings (last column in Table 2) and appropriate genre is also used in further calculations of conditional probability to choose a program of specific genre:

    Day 

    Channel

    Rating

    Genre

    10 Nov

    1

    1 

    N  

    09 Nov 

    3 

    0,96 

    N

    08 Nov 

    1 

    0,1 

    N 

    07 Nov 

    1 

    0,84 

    N 

    06 Nov 

    2 

    0,27 

    S

    05 Nov 

    5 

    0,001 

    S

    04 Nov 

    1 

    0,51 

    N 

    03 Nov 

    2 

    0,10

    S 

    01 Nov 

    1 

    0,01 

    N 

    1. Correspondence of the viewings genre and ratings

    Then ratings for the same genres are summarized:

    Genre

    Summarized Ratings

    Probability 

    N

    3,42

    0,8997

    S

    0,38

    0,1

    M

    0.001

    0.0003

    1. Summarized ratings and probabilities relating genres

    Again it is needed to assign very little (but non-zero) summarized ratings for the other available genres (as it is done for movies).

    When the user "turns on" the TV set (at 18:53 on November the 11th), the following programs are going on (EPG provides their description):

    Channel

    Genre

    1

    N 

    2 

    S 

    3 

    N 

    4 

    S 

    5 

    M 

    1. Genre of TV programs broadcasted at specific time (18:53)

    Applying this EPG information on currently broadcasted programs (Table 8) and knowledge on the viewer preferences on genre (Table 7) to summarized ratings, the probabilities are finally calculated for these five channels:

    Channel

    Summarized Ratings

    Genre weight 

    Final Summarized Ratings (product)

    Final Probability 

    1

    2,46 

    0,8997 

    2,21 

    0,71

    2 

    0,37 

    0,1 

    0,037 

    0,012

    3 

    0,96 

    0,8997 

    0,86 

    0.277

    4 

    0,00001 

    0,1 

    0,000001 

    0,0000003

    5 

    0,001 

    0.0003 

    0,000003 

    0,0000007

    1. Summarized ratings and probabilities

    The other classifiers could be taken into account the same way.

    1. Monetization Opportunities

    High return on investments to development of such algorithms is expected due to low cost of their implementation and good enough improvement of the system effectiveness. Of course, monetization of the algorithm is also possible based on advertisement targeting (it is provided by default - the user is directed to the channel he/she is interested in). At the same time another opportunity is available.

    TV operator may increase probability of choice for the specific channel. It is recommended to do it by multiplying of the final summarized rating (the 4th column in Table 9). It is important to do it on the last step because it helps to exclude choice of TV program with wrong genre (it should not affect the genre "weight" calculation). So it practically prevents the situation when the user never watches sports but the TV set is directed to a football game. The advertised channel may have some preferences only in case of correspondence of genre of its programs with previously chosen ones.

  4. Conclusions

    Making recommendations for the user becomes a very common task for the software developers. It could be considered as a standard component of the system usability. So the developers shall analyze the specific user's activities to define his/her preferences and make recommendations on access to the most interesting content. With regard to TV industry, it means direction to one of available channels.

    It requires an effective enough, cheap, flexible, and easy implemented algorithms like the described one. It is based on analysis of the user activities cyclicality and industry specifics. Content description consideration improves the prediction if EPG data is available (however it is not necessary). The instantaneous tip to watch one of preferred programs is provided implicitly (no actions from the user are required). The approach is partly founded on the listed above assumptions on user behavior and TV industry practice.

    To formulate a generalized methodology for simple implementation of user interface self-adjustment and content navigation, it is necessary to take into account more possible factors including: 1) Unlimited choice of content; 2) Identification support; 3) Analysis of recommendation effectiveness and its usage for the algorithm self-adjustment; 4) Analysis of the other fields in the content description like title or summary. It will be done in future publications.

  5. ACKNOWLEDGMENT

    The authors would like to acknowledge the contribution of Prof. Konstantin Glasman (Head of Video Systems Department at St.Petersburg State University of Film and Television), Vladimir Kolesnikov (Microsoft Technical Evangelist), Kirill Kostyushkin (CEO, Adakta), and Ivan Platonov (SPbGU Politech).