Abstract
The use of wearable cameras makes it possible to record life logging egocentric videos. Browsing such long unstructured
videos is time consuming and tedious. Segmentation into meaningful chapters is an important first step towards adding
structure to egocentric videos, enabling efficient browsing, indexing and summarization of the long videos. Two sources
of information for video segmentation are (i) the motion of the camera wearer, and (ii) the objects and activities
recorded in the video. In these works we address the motion cues for video segmentation.
In our most recent paper,
Compact CNN for Indexing Egocentric Videos, we present a compact 3D Convolutional Neural
Network (CNN) architecture for long-term activity recognition in egocentric videos. Recognizing long-term activities
enables us to temporally segment (index) long and unstructured egocentric videos. Given a sparse optical flow volume as input,
our CNN classifies the camera wearer's activity. We obtain classification accuracy of 89%, which outperforms our previous work
by 19%. Additional evaluation is performed on an extended
egocentric video dataset,
classifying twice the amount of categories than our previous work. Furthermore, our CNN is able to recognize whether a video
is egocentric or not with 99.2% accuracy, up by 24% from current state-of-the-art. To better understand what the network actually
learns, we propose a novel visualization of CNN kernels as flow fields.
In our original work,
Temporal Segmentation of Egocentric Videos, we propose a robust temporal segmentation of egocentric
videos into a hierarchy of motion classes using a new
Cumulative Displacement Curves. Unlike instantaneous motion vectors,
segmentation using integrated motion vectors performs well even in dynamic and crowded scenes. No assumptions are made on the
underlying scene structure and the algorithm works in indoor as well as outdoor situations. We demonstrate the effectiveness of
our approach using publicly available videos as well as choreographed videos. We also suggest an approach to detect the fixation
of wearer's gaze in the walking portion of the egocentric videos.