ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力

輕薄滴假象

發布于 2024-6-20 15:29

瀏覽

0收藏

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

繼Sora官宣之后，多模態大模型在視頻生成方面的應用簡直就像井噴一樣涌現出來，LUMA、Gen-3 Alpha等視頻生成模型展現了極佳質量的藝術風格和視頻場景的細節雕刻能力，文生視頻、圖生視頻的新前沿不斷被擴展令大家驚喜不已，抱有期待。

最近，來自中國科學技術大學、北京大學、上海 AI Lab等團隊的研究人員發布了引人矚目的 ShareGPT4Video系列，旨在提升視頻理解和生成能力。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

論文鏈接: ???https://arxiv.org/abs/2406.04325v1???
項目鏈接：???https://sharegpt4video.github.io/???
數據集鏈接：???https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video???
代碼鏈接: ???https://github.com/ShareGPT4Omni/ShareGPT4Video???
Demo鏈接: ???https://huggingface.co/spaces/Lin-Chen/ShareCaptioner-Video???

在過去半年中，圖像-語言多模態領域在ShareGPT4V的高質量圖像-字幕數據集的推出后逐漸意識到詳細、準確的圖像-字幕數據對于對齊圖像與語言模態的重要性。ShareGPT4V數據集推出至今已在HuggingFace平臺的VQA dataset track上獲得了歷史第二高的點贊數。

建立在高質量的ShareGPT4V數據集上，圖像理解和圖像生成社區也都取得一些突破性的進展，例如InternVL-Chat-V1.5與PixArt-Σ等工作。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

受ShareGPT4V數據集在圖文多模態領域的成功所鼓舞，原作者團隊把目光再次投向視頻多模態領域。視頻多模態領域中閉源商業模型一直處于斷層領先的地位，一方面，OpenAI和谷歌近期接連的兩場發布會，把AI視頻推理卷到了新高度。另一方面，OpenAI的Sora文生視頻模型則把文生視頻帶到了一個全新的高度。

研究者們認為閉源模型對于視頻理解和視頻生成領域的巨大領先同樣離不開詳細高質量的視頻-字幕數據。因此，該研究團隊再次致力于為視頻獲取大量詳細而精確的字幕，提升大型視頻語言模型的視頻理解能力和文生視頻模型的視頻生成能力。

目前，該研究在HuggingFace的6月7日Daily Papers中位居榜首，并且在代碼公布后迅速獲得500+ Star，得到了國內外的一致關注。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

研究者們認為用現有的閉源模型生成高質量視頻描述的挑戰有三個方面:

清晰地理解幀間的時序變化。
詳細準確地描述幀內內容。
對任意長度視頻的可擴展性。

為此，研究者們精心設計了一種差分滑窗視頻描述（Differential Sliding-Window Captioning, DiffSW）策略，該策略可以穩定且高效地為任意分辨率，寬高比和長度的視頻生成高質量描述。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

圖 1：差分滑動窗口視頻描述生成

具體而言，研究者們每次送入GPT4V的輸入是當前關鍵幀，上一關鍵幀以及上一關鍵幀對應的差分描述，旨在讓GPT4V根據觀察兩幀之間的時間與空間變化總結出當前幀相對于上一幀的重要空間、時序變化，即當前幀與上一幀對應的差分描述。最終，所有差分描述會連同時間戳一起送入GPT4中從而總結出最終的關于整個視頻的高質量字幕。

該研究團隊展示了幾個示例：

Caption 1:

The video segment documented a significant event in Kochi, Kerala, where 2 buildings razed in Kochi. The broadcast began with a split-screen presentation: on one side, thick clouds of dust were seen billowing into the sky, marking the onset of the demolition process, while on the other side, reporter Gopikrishnan provided live coverage, indicated by "BREAKING NEWS" captions and a consistent timestamp of "11:10 AM." The news ticker at the bottom of the screen simultaneously ran other global events, maintaining a flow of information. As the video progresses, the split-screen footage of the razed house turns into a close-up. A notable change in the headline to "KOCHI FLATS RAZED" signaled the demolition's culmination. A brief interlude offered a visual contradiction by showcasing the flats presumably before their demolition, providing a stark before and after comparison. As the video progressed, the left building's collapse initiated a dramatic alteration in the skyline, marked by significant dust plumes. Subsequently, another building was shown partially collapsing amid debris, fully obscured by dust in seconds, with surrounding greenery remaining untouched. This transitioned into a graphic interlude featuring the "India Today" logo, briefly pausing the live footage. Resuming to the aftermath, split imagery displayed the rubble and ongoing smoke. Then, the imagery continued to juxtapose the scenes of destruction against intact high-rise buildings nearby. The narrative was augmented by the revelation that the Supreme Court directed the demolition within a broader national news context. Throughout, the report maintained a real-time approach, threading continuity and urgency across the unfolding event's documentation.

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

Caption 2:

The video begins with an individual seated on a gray couch in a cozy domestic setting, about to unbox a product from a red CCM-branded box placed on a white table in front of them. Initially, the person is seen drinking from a blue can, indicating a casual atmosphere. Soon after, the individual shifts attention from the can to the red box, signifying the start of the unboxing process. The red box, initially closed, gradually becomes the focal point as the person prepares to open it, conveying a build-up of anticipation. As the video progresses, the box is flipped over and then opened, revealing its content still hidden under white tissue paper adorned with prints, adding to the suspense. The individual’s engagement with the box evolves, from initially preparing to open it, to actively delving into its contents. A momentary pause in activity is captured before the anticipation culminates with the individual lifting an object from the box. This object, identifiable by a yellow label, is then examined closely by the person, indicating a thorough inspection or perusal of the product or its packaging. Throughout the video, the surrounding environment remains consistent and undisturbed, with household items like a potted plant and a wall clock maintaining the setting's homely ambiance. The camera’s perspective remains fixed, focusing on the unfolding unboxing event without any movement, thus allowing the viewer to observe the narrative closely. Another partially open brown box is visible beside the main red box, though its role or contents are not elaborated upon. The video encapsulates the anticipation, action, and reveal inherent to unboxing experiences in a home setting.

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

通過這一方法，研究者們推出了大型“視頻-文本描述”數據集--ShareGPT4Video數據集，其中包括4萬條（共291小時）由GPT-4V標注的視頻數據。這些數據涵蓋了廣泛的類別，生成的描述包含豐富的世界知識，對象屬性，攝像機運動，以及詳細和精確的事件時間描述。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

圖 2 ：（a）數據集涵蓋廣泛的內容，包括野生動物、烹飪、體育、風景、第一人稱人類活動、自動駕駛場景等。(c) 字幕的字數主要在 200 到 400 之間，提供了豐富的時間信息，可以很好地完成視頻理解和生成任務。

在ShareGPT4Video數據集的基礎上，為了進一步擴大數據集規模以及便于開源社區在自有數據上的使用，研究者們進一步設計開發了ShareCaptioner-Video，一個能夠有效地為任意視頻生成高質量描述的多功能多模態大模型。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

圖 3：ShareCaptioner-Video 是一款四合一的特殊視頻描述模型，具有以下功能：滑動窗口生成視頻描述、快速生成視頻描述、視頻片段對應描述整合，提示詞生成詳細描述

具體而言，滑窗視頻描述功能可以擔任GPT4V收集標注數據中的全部角色，并且通過滑窗的方式來產生差分描述并匯總出最終的字幕。快速視頻描述功能則是把所有關鍵幀沿豎直方向拼成一張長圖一次性產生最終的字幕，在略微犧牲性能的情況下大幅提升標注速度。視頻片段總結功能則可以在對完整視頻進行一次滑窗描述后，對其中任意的視頻片段直接總結出字幕而不需要再次進行滑窗描述過程。

在得到了優異的視頻描述模型后，研究者們用它進一步標注了480萬條，總時長3000小時的豐富的視頻數據。這些視頻具有較高的美學評分以及較少的轉場效果，旨在為視頻生成任務服務。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

表1：由 ShareCaptioner-Video 標注的480萬條視頻數據的構成

實驗

在視頻理解方面，研究者們首先通過簡單的等量替換實驗，驗證了ShareGPT4Video數據集在幾種當前LVLM架構上的有效性。研究者們把VideoChatGPT數據集中100K視頻訓練數據中的與詳細caption相關的28K數據等量替換成ShareGPT4Video數據集中的子集。從下表可以看到，通過簡單的數據替換，僅僅是字幕數據質量上的提升便可以一致地為不同架構、不同規模的視頻理解多模態大模型帶來顯著的性能增益。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

表 2：ShareGPT4Video數據集在各模型架構上均能產生性能增益

之后，研究者們自主收集了153K的視頻VQA數據，并結合ShareGPT4Video數據集中與視頻理解相關的28K高質量字幕數據，提出了新的LVLM ShareGPT4Video-8B。僅需8卡以及5個小時的訓練開銷，即可在多項Benchmark上取得優異的結果。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

表 3 ：TempCompass上性能對比

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

表 4 ：VideoBench上性能對比

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

表 5：MVBench上性能對比

即使是在最近新出現的幾個視頻理解基準上，ShareGPT4Video-8B也可以在7B參數規模上一致地展現出具有競爭力的性能。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

表 6 ：LongVideoBench上性能對比

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

表 7 ：Video-MME基準性能對比

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區

表 8：MMBench-Video基準性能對比

在視頻生成方面，研究者們基于Open-Sora-Plan項目簡單直接地驗證了詳細的字幕數據對于文生視頻模型的幫助。下圖中，第一行的結果是使用了短字幕數據訓練出的文生視頻模型得到的，第二行的結果是使用了ShareCaptioner-Video標注的高質量字幕數據訓練出的文生視頻模型得到的。可以看到，使用詳細的字幕數據可以讓文生視頻模型具備優異的鏡頭移動控制以及語義內容控制能力。

ShareGPT4V作者團隊又一力作！百萬高質量視頻-字幕數據助力社區提升多模態大模型視頻理解及生成能力-AI.x社區