V3C - a Research Video Collection

With the widespread use of smartphones as recording devices and the massive growth in bandwidth, the number and volume of video collections has increased significantly in the last years. This poses novel challenges to the management of these large-scale video data and especially to the analysis of and retrieval from such video collections. At the same time, existing video datasets used for research and experimentation are either not large enough to represent current collections or do not reflect the properties of video commonly found on the Internet in terms of content, length, or resolution. In this paper, we introduce the Vimeo Creative Commons Collection, in short V3C, a collection of 28'450 videos (with overall length of about 3'800 hours) published under creative commons license on Vimeo. V3C comes with a shot segmentation for each video, together with the resulting keyframes in original as well as reduced resolution and additional metadata. It is intended to be used from 2019 at the International large-scale TREC Video Retrieval Evaluation campaign (TRECVid).


Introduction
Over recent years, video has become a significant portion of the overall data which populates the web. This has been due to the fact that the production and distribution of video has shifted from a complex and costly endeavor to something accessible to everybody with a smart phone or similar device and a connection to the internet. This growth of content enabled new possibilities in various research areas which are able to make use of it. Despite the access to such large amounts of data, there remains a need for standardized datasets for computer vision and multimedia tasks. Multiple such datasets have been proposed over the years. A prominent example of a video dataset is the IACC [5] which has been used for several years now for international evaluation campaigns such as TRECVid [2]. Other examples of datasets in the video context include the YFCC100M [8] which, despite being sourced from the photo-sharing platform Flickr 3 , contains a considerable amount of video material, the Movie Memorability Database [4] which is comprised of memorable sequences from 100 Hollywood-quality movies or the YouTube-8M [1] dataset which in contrast, despite being sourced from YouTube 4 , does not contain the original videos themselves. The content of all of these collections does, however, differ substantially from the type of web video commonly found 'in the wild' [7].
In this paper, we present the Vimeo Creative Commons Collection or V3C for short. It is composed of 28'450 videos collected from the video sharing platform Vimeo 5 . Apart from the videos themselves, the collection includes meta and shot-segmentation data for each video, together with the resulting keyframes in original as well as reduced resolution. The objective of V3C is to eventually complement or even replace existing collections in real-world video retrieval evaluation campaigns and thus to tailor the latter more to the type of video that can be found on the Internet.
The remainder of this paper is structured as follows: Section 2 gives an overview of the process of how the collection was assembled and Section 3 introduces the collection itself, its structure and some of its properties. Finally, Section 4 concludes.

Collection Process
The requirements for usable video sources from which to compile a collection were as follows: -The platform must be freely accessible.
-It must host a large amount of diverse and contemporary video content.
-At least a portion of the content must be published under a creative commons 6 license and can therefore be redistributed in such a collection.
Two candidates for such collections are Vimeo and YouTube. Vimeo was chosen over YouTube because while YouTube offers its users the possibility to publish videos under a creative commons attribution license which would allow the reuse and redistribution of the video material, YouTube's Terms of Service [9] explicitly forbid the download of any video on the platform for any reason other than playback in the context of a video stream.
We utilized the Vimeo categorization system for video collection. Videos are placed in 16 broad categories, which are further divided into subcategories. Videos in each category were examined to determine if they satisfied the 'real world' requirements for the collection. Four top level categories were included in the collection, while 3 were excluded. For the remaining 9 categories, only some subcategories were included. The following are the 4 categories completely included in the collection: 'Personal', 'Documentary', 'Sports' and 'Travel'.
An overview of the excluded categories can be seen in Figure 1. Categories that had very low visual diversity (such as 'Talks'), or did not represent real world scenarios were removed. Categories (or subcategories) with a lot of animation/graphics, or non standard content with little or no describable activity were excluded from the collection. Videos from the selected categories were then filtered by duration and license.
The obtained list of candidate videos was downloaded from Vimeo using an open-source video download utility 7 . The download was performed sequentially in order to not cause unnecessary load on the side of the platform. All downloaded videos were subsequently checked to ensure they could be properly decoded by a commonly used video decoding utility 8 .
The videos were segmented and analyzed using the open-source contentbased video retrieval engine Cineast [6]. Videos with a distribution of segment lengths which were sufficiently different from the mean were flagged for manual inspection as this indicated either very low or very high visual diversity as in the cases of either mostly static frames or very noisy videos. During this step, videos were also checked to ensure that the collection does not contain exact duplicates.
Out of the remaining videos, three subsets with increasing size were randomly selected. Sequential numerical ids were assigned to the selected videos in such a way that the first id in the second part is one larger than the last id in the first part and so on, in order to facilitate situations in which multiple parts are to be used in conjunction.

The Vimeo Creative Commons Collection
The following provides an overview of the structure as well as various technical and semantic properties of the Vimeo Creative Commons Collection.

Collection Structure
The collection consists of 28'450 videos with a duration between 3 and 60 minutes each and a total combined duration of slightly above 3'800 hours, divided into three partitions. Table 1 provides an overview of the three partitions. Similar to the IACC, the V3C also includes a master shot reference which segments every video into sequential non-overlapping parts, based on the visual content of the videos. For every one of these parts, a full resolution representative key-frame as well as a thumbnail image of reduced resolution is provided. Additionally, there are meta data files containing both technical as well as semantic information for every video which was also obtained from Vimeo.
Every video in the collection has been assigned a sequential numerical id. These ids are then used for all aspects of the collection. Figure 2 illustrates  the directory structure which is used to organize the different aspects of the collection. This structure is identical for all three partitions. The info directory contains one json-file per video which holds metadata obtained from Vimeo. This metadata contains both semantic information -such as video title, description and associated tags -as well as technical information including video duration, resolution, license and upload date. The msb directory contains for each video a file in tab-separated format which lists the temporal start and end-positions for every automatically detected segment in a video. The keyframes and thumbnails directories each contain a subdirectory per video which hold one representative

Statistical Properties
The following presents an overview of the distribution of selected categories throughout the collection. The age distribution of the videos of the entire collection as determined by the upload date of the individual video is illustrated in Figure 3. It is shown in comparison to the distribution originally presented in [7] for a large sample of Vimeo in general. The trace representing the V3C is less clean than the one for the Vimeo dataset due to the large difference in number of data points. It can however still be seen that both traces have a similar overall shape, at least for the parts of the plot where there is data available for both. Other than the Vimeo dataset from [7], the collection of which was completed mid 2016, the V3C includes videos from as late as early 2018 which explains the difference in shape towards the right side of the plot.
The distribution of video duration and resolution is shown in Figures 4 and  5 respectively, again in comparison to the larger Vimeo distributions. It can be seen that wherever there were no additional restrictions, the properties of the V3C follow those of the overall Vimeo dataset rather closely. At least in terms of these three properties, the V3C can therefore be considered reasonably representative of the type of web video generally found on Vimeo.
An overview of the languages detected by the same method as employed in [7], based on the title and description of the videos can be seen in Table 2. It shows the top-10 languages for either the V3C or the dataset from [7]. The column labeled '?' represents the instances where language detection did not yield any result. It can be seen that for the videos, the titles and descriptions of which were distinct enough for language detection, the distribution within the V3C is similar to the Vimeo dataset. No language analysis based on the audio data of the videos has been performed yet. Table 3 shows the categories and the number of videos per collection part which have been assigned to a particular category on Vimeo. Every video can    be assigned to multiple categories, the numbers shown in the table do therefore not sum to the total number of videos. Despite the categories having a structure which implies a hierarchy, a video can be assigned to both a category and subcategory, but it does not have to. The large number of used categories shown in the table implies a wide range of content which can be found in the collection.

Possible Uses
Due to the large diversity of video content contained within the collection, it can be useful for video-related applications in multiple areas. The large number of different video resolutions -and to a lesser extent frame-rates -makes this dataset interesting for video transport and storage applications such as the development of novel encoding schemes, streaming mechanisms or error-correction techniques. Its large variety in visual content makes this dataset also interesting for various machine learning and computer vision applications.
Finally, the collection has applications in the area of video analysis, retrieval and exploration. For example, we can imagine four possible application areas in the video retrieval space. First, video tagging or high-level feature detection where the goal is given a video segment or shot, the system should output all the relevant tags and visual concepts that are in this video. Such a task is very fundamental to any video search engine that tries to match users search queries with video dataset to retrieve the most relevant results. Second, ad-hoc video search where a system takes as input a user text query as a natural language sentence and returns the most relevant set of videos that satisfies the information need in the query. Such a task is also necessary for any search system that deals with real users where it has to understand the user query and intention before retrieving the set of results that matches the text query. Third, trying to find a video or a video segment which one believes to have seen but the name of which one does not recall is often called known item search. Queries are created based on some knowledge of the collection such that there is a high probability that there is only one video or video segment that satisfies the search. Fourth, the application of video captioning or description in recent years gained a lot of attention. Here the idea is how can a system describe a video segment in a textual form that contains all the important facets such as 'who', 'what', 'where', 'when' so essentially textual summary of the video. As the V3C collection includes a master shot boundary splitting a whole video into smaller shots, the video captioning task can be run on those small video shots as currently the state of the art can not handle longer videos and give a logical and human readable description for the whole video in textual form.

Availability
We are planning to launch and make available this collection at the 2019 TREC-Vid video retrieval benchmark where different research groups participate in one or more tracks. In addition, the collection will be shared at the Interactive Video Browser Showdown (VBS) [3] which collaborates with TRECVid organizing the Video Ad-hoc Search track. The collection will be available to the benchmark participants as well as the public for download. After the annual benchmark cycle is concluded, we will also provide the ground truth judgments and queries/topics for the tasks that used the V3C collection so that research groups can reuse the dataset in their local experiments and reproduce results.

Conclusions
In this paper, we introduced the Vimeo Creative Commons Collection (V3C). It is comprised of roughly 3'800 hours of creative commons video obtained from the web video platform Vimeo and is augmented with technical and semantic metadata as well as shot boundary information and accompanying keyframes. V3C is subdivided into three partitions with increasing length from roughly 1'000 hours up to 1'500 hours so that the collection can be used for at least three consecutive years in a video search benchmark with increasing complexity. Information on where to download the V3C collection and/or its partitions will be made available together with the publication of the video search benchmark challenges.