A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs

Jagannadan Varadarajan, Rémi Emonet, Jean-Marc Odobez

This page is the entry point for the additional material associated with the paper "Sequential Topic Models for Mining Recurrent Activities from Audio and Video Data Logs", that is submitted. The contents of this material are as follows:

Sample temporal documents

Given videos recorded from static cameras, we generate temporal documents by first applying a PLSA step on 1 second documents made of bag-of-words of low-level motion features. The estimated weights of these recovered topics at each time instant, weighted by the number of words, is used to build the input documents of our algorithm. Two sample documents are given below.

tdoc for rue
Temporal document of 300 seconds (5 minutes) from “Far-field” video (see Fig. 1(b) in the paper). This scene is not controlled by traffic lights so there are no particular cycles.
tdoc for mit
Temporal document of 300 seconds (5 minutes) from “MIT” (see Fig. 1(a) in the paper) video. This scene is controlled by traffic lights. We can observe some periodic activities in the temporal document.

Different representations of a motif

Each motif extracted using our algorithm can be interpreted as the probabilities of occurrence of each word at each relative time step. For video data, given the way our documents were constructed, we proposed 3 different ways of representing the motifs in the paper.

Representation scheme 1: Each time instant overlayed in a seperate image. The 10 images above indicate the activity occurring at each time instant of the motif.



Representation scheme 2: This is the generic representation. The motif is displayed using a matrix image where white points represent a 0 probability and darker points represent higher probabilities. Representation scheme 3: For video data, each word corresponds to a region of activity in the images. We can create a small animation by displaying in an image at each time step the backprojection of the words presents at that time step in the motif. Representation scheme 4: We can create a static image by merging all time steps from the animation into a single image (as used in the paper).

The color palette we use to represent evolution of time ranges from violet to red. The beginning of the motif is blue or violet (depending on whether the first time step is empty or not), the end is red. Red always represents the activity/words occurring at the maximal end of the motif, whatever the duration of the motif that was actually learned. Thus, if we seek for motifs of 20 seconds and that the learned topic is actually of length 10 seconds, we will not see red activities in the animation (and no intermediate green or orange either).

Motifs from Far-field data

The Far-Field scene (see Fig. 1(b) in the paper) is a footage of a three-road junction captured from a distance, where typical activities are moving vehicles. As the scene is not controlled by a traffic signal, activities occur at random. The video duration is 108 minutes, recorded at 25 fps and a 280x360 frame resolution. When the motif length was set to 10 seconds, we obtained 20 motifs using the BIC measure. Here are the 20 motifs presented in both the gif and collapsed representations.

Note: In all the examples below, the motifs can be sorted based on their p(z) values by checking the button on the top left of the page. Motif number appears when the mouse is moved over the gif/image.

Motifs of 10s Duration

Motifs of 20s Duration

10s Motifs: We can observe few motifs resemble each other closely and represent the same activity. For e.g., motif (7) and motif (9) from the 10s motifs represent activities of vehicles going from bottom right to top right, and similarly, motif (5) and motif (6) represent vehicles taking a turn and approaching bottom right of the scene. These are motifs that can potentially be merged. Interestingly, we see that PLSM also captures all the noise appearing from the low-level feature extraction like background subtraction and optical-flow estimation in a seperate motif, but with a much smaller weight p(z) = 0.008 as they dont appear too often.

Motifs from MIT data

The MIT scene is a video footage of a two-lanes and four-road junction captured from a distance, where there are complex interactions among vehicles arriving from different directions, and few pedestrians crossing the road. This dataset contains 90 minutes of video, recorded at 30 frames-per-second (fps), and a resolution of 480x756 which was downsampled to half its size.

Motifs of 10s Duration

Motifs of 20s Duration

10s Motifs: The motifs extracted represent both the static activities and motion activities very well. Motifs. (0,2,12,20,24) are some examples of static activities as vehicles waiting for the signal. The movement of trees due to wind in the scene also create words resulting in a motif (9). Note that words over all the trees in the scene co-occur. The rest of the motifs like motif (3,4,5,13) are mainly due to movement of vehicles in different directions.

Motifs from Traffic junction data

The Traffic Junction data(see Fig. 1c in the paper) captures a portion of a busy traffic-light-controlled road junction. In addition to vehicule moving in and out of the scene, activities in this scene also include people walking on the pavement or waiting before crossing over the zebra crossings. The video, recorded at 25 fps and a 280 x 360 frame resolution, lasts for 44 minutes.

Motifs of 10s Duration

Motifs of 20s Duration

10s Motifs: PLSM has extracted meaningfull motifs even when the training data is relatively small. Motifs (1,3,4,5,6) are representing vehicular activities. Note the subtle difference between motif (4) and motif (6), where (motif 4) is taking a turn, motif (6) is going straight. We can also observe pedestrian activities in motifs (2,8,11,13,15). Motif (15) captures people criss-crossing the road which is visualized quite well using the gifs.

Motifs from Metro station data

The Metro station data (see Fig. 1(d) in the paper) is a footage from a crowded hall-way in one of the metro stations in Rome. The video is 2 hours long, recorded at 5 fps with a 576x720 frame resolution. The activities are only due to people and therefore, unstructured to a high degree. When the motif length was set to 10 seconds, we obtained 25 motifs using the BIC measure. Here are the 25 motifs presented in both the gif and collapsed representations. These motifs are extracted only from optical flow features and no back-ground subtraction.

Note: In all the examples below, the motifs can be sorted based on their p(z) values by checking the button on the top left of the page. Motif number appears when the mouse is moved over the gif/image.

Motifs of 10s Duration

Motifs of 20s Duration

10s Motifs: Even though the scene is extrmely crowded, we could get meaningful activities as dominant patterns. For example, in 10 second long motifs represent activities like people walking from lower right, middle and left part of the scene to the top. We can also see activities of like people moving towards the turnstiles from different directions, crossing them and reaching the platform. Since the activities here are usually shorter, activities from 20 second long motifs are not significantly different. However, motifs like the 9th one represents a complete activity of two groups of people coming towards the turnstiles, crossing it and reaching the platform, one following the other.



End