Introduction

With the development of large language models and multimodal alignment techniques, video understanding models have also made significant progress in general open domains. However, most current video understanding models use frame averaging and video token compression methods, resulting in the loss of temporal information and the inability to accurately answer time-related questions. On the other hand, some models focused on temporal question-answering datasets are overly restricted to specific formats and applicable domains, causing the models to lose more general question-answering capabilities. In this paper, we propose an automated temporal grounding data construction method based on visual models, generating 30k time-related video question-answering data. Then, based on this new dataset and existing open-domain question-answering data, we introduce multi-frame video images and timestamps as encoder inputs to train a new video understanding model—CogVLM2-Video. CogVLM2-Video not only achieves state-of-the-art performance on public video understanding benchmarks but also excels in video captioning and temporal grounding, providing a powerful tool for subsequent tasks such as video generation and video summarization.

Model Architecture

Currently, the mainstream approach in video understanding involves using image encoders to extract frames from videos, encoding them, and then designing encoding compression modules (e.g., temporal pooling or Q-Former modules) to compress the video encoding information before inputting it into a large language model (LLM) for joint understanding with textual inputs. Although this method effectively compresses video information, it causes the model to lose temporal awareness, preventing it from accurately associating video frames with precise timestamps. Consequently, the model lacks the capability for temporal localization, timestamp detection, and summarizing key moments. Additionally, video understanding models trained with existing temporal grounding annotated data are limited by the data's scope and the fixed format of question-answering, resulting in a lack of open-domain question-answering and processing capabilities. To address these issues, we propose CogVLM2-Video, an extended video model based on the CogVLM2 image understanding model. This model not only achieves state-of-the-art performance in open-domain question-answering but also perceives timestamp information within videos, enabling temporal localization and related question-answering. Specifically, we extract frames from the input video segments and annotate them with timestamp information, allowing the subsequent language model to accurately know the exact time each frame corresponds to in the original video. The figure below shows the overall architecture of our model:
data-composition

Temporal Grounding Q&A Datasets

Additionally, the training of video understanding models using existing temporal grounding annotation data is limited by the scope of the data and the fixed format of question and answer pairs, lacking the capability for open-domain question answering and processing. Compared to the plain text data used to train LLMs and the image understanding data used to train VLMs, the annotation cost for high-quality video question answering and temporal grounding data is extremely high. Manual annotation alone cannot meet the demands of large-scale training. To prepare temporal grounding data suitable for large-scale training, we developed a fully automated video question answering data generation process. We leverage the latest image understanding models to extract frame-level understanding from video data, and then use large language models for data filtering and generation. Through this automated data processing workflow and large-scale training, CogVLM2-Video not only excels on public benchmarks but also possesses the temporal question answering capability that most previous video models lacked. The figure below shows the construction process, through which we ultimately generated 30k Temporal Grounding Question and Answer (TQA) data points.:
data-composition

Evaluation

Evaluation results on VideoChatGPT-Bench + Zero-shot QA:
Evaluation results on MVBench:

Citation

@article{hong2024cogvlm2,
  title={CogVLM2: Visual Language Models for Image and Video Understanding},
  author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others},
  journal={arXiv preprint arXiv:2408.16500},
  year={2024}
}