字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？

CourseAI

發布于 2025-5-26 00:43

瀏覽

0收藏

一、現有方案的局限性

現有的文檔圖像解析解決方案主要分為兩大類：基于集成的方法和端到端的方法。

基于集成的方法通過將多個專家模型組裝到一個多階段的流水線中來實現文檔解析，這些方法雖然在特定任務上表現出色，但需要對每個模型進行獨立優化，并且在組件間協調方面面臨挑戰。
端到端的方法則利用通用或專家視覺語言模型（VLMs）直接自回歸地生成頁面級內容，雖然能夠捕捉頁面級語義，但在解析長文檔和復雜布局時，常常會遇到布局結構退化和效率瓶頸的問題。

Dolphin案例展示

版式識別、閱讀順序識別

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

公式識別

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

給定box區域提取區域的內容

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

復雜的表格可以輕松轉出markdown

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

無線表格輕松拿捏

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

二、Dolphin解決方案

Dolphin （Document Image Parsing via Heterogeneous Anchor Prompting）采用了一種分析-解析范式（analyze-then-parse），將文檔解析過程分解為兩個階段：

第一階段進行頁面級布局分析，生成自然閱讀順序的布局元素序列；
第二階段則利用這些元素作為錨點，結合任務特定的提示，進行并行的內容解析。

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

這種兩階段的設計既避免了傳統集成方法中多模型協調的復雜性，又克服了端到端方法在復雜布局和長文檔解析中的效率瓶頸，還能通過輕量級架構和并行解析機制實現優越的運行效率。

Dolphin支持的布局元素

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

2.1. 頁面級布局分析階段

頁面圖像編碼

Dolphin 使用 Swin Transformer 作為視覺編碼器，將輸入的頁面圖像編碼為視覺嵌入序列。

Swin Transformer 的層次化設計能夠同時捕捉全局布局模式和局部文本細節。

輸入圖像在編碼前會被調整并填充到固定大小，以保持其寬高比，避免文本失真。

布局序列生成
在布局序列生成過程中，解碼器以布局分析提示（Playout）為引導，通過交叉注意力機制關注編碼后的視覺特征。
解碼器采用 mBart 架構，能夠識別并按順序排列文檔元素，同時保留結構關系（例如圖表與標題的配對、表格與標題的關聯以及章節標題與段落的層次結構）。
最終生成的布局元素序列包含每個元素的類型（如圖表、標題、表格、段落）和邊界框信息，這些元素將作為第二階段的錨點。L = {l1, l2, ..., ln}

2.2 元素級內容解析階段

元素圖像編碼

對于第一階段識別出的每個布局元素 li

Dolphin 從原始圖像中裁剪出對應的區域，形成局部視圖 Ii。

這些局部視圖通過相同的 Swin Transformer 并行編碼，生成元素特定的視覺特征。

并行內容解析
在并行內容解析階段，Dolphin 利用類型特定的提示來指導不同元素的解析。
例如，表格使用專用的表格提示（Ptable）來解析其 HTML 格式
而公式則與文本段落共享相同的提示（Pparagraph），因為它們通常在段落上下文中以行內和顯示模式出現，盡管它們的標記格式為 LaTeX。
給定局部視圖 Ii 的視覺特征及其對應的提示 pi，解碼器能夠并行生成解析后的內容。這種并行處理策略結合元素特定的提示，確保了計算效率，同時保持了準確的內容識別。

訓練方案

構建數據集

包含超過 3000 萬個樣本的大規模數據集，涵蓋了頁面級文檔和元素級組件。

數據集的來源包括混合文檔、HTML 文檔、LaTeX 文檔、Markdown 文檔、表格和公式等。

這些數據通過不同的方式進行了處理和標注，以滿足模型訓練的不同需求。

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

訓練過程中

Dolphin 采用了動態任務選擇策略，根據訓練樣本的可用標注隨機選擇適用的任務，從而構建問題-答案對。

這種策略能夠提高模型的泛化能力，使其能夠處理多種類型的文檔解析任務。

還采用了預訓練權重初始化的方法，通過在 Donut 模型的基礎上進行指令調優，擴展了模型對多樣化提示的理解和執行能力。

Dolphin實戰

通過這個鏈接可以免費試用??http://115.190.42.15:8888/dolphin/??

Dolphin 提供了兩個推理框架，支持兩個解析粒度：

頁面級解析：將整個文檔圖像解析為結構化的 JSON 和 Markdown 格式
元素級解析：解析單個文檔元素（文本、表格、公式）

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？-AI.x社區

頁面解析

import argparse
import glob
import os

import cv2
import torch
from PIL import Image
from transformers import AutoProcessor, VisionEncoderDecoderModel

from utils.utils import *


class DOLPHIN:
    def __init__(self, model_id_or_path):
        """Initialize the Hugging Face model
        
        Args:
            model_id_or_path: Path to local model or Hugging Face model ID
        """
        # Load model from local path or Hugging Face hub
        self.processor = AutoProcessor.from_pretrained(model_id_or_path)
        self.model = VisionEncoderDecoderModel.from_pretrained(model_id_or_path)
        self.model.eval()
        
        # Set device and precision
        self.device = "cuda"if torch.cuda.is_available() else"cpu"
        self.model.to(self.device)
        self.model = self.model.half()  # Always use half precision by default
        
        # set tokenizer
        self.tokenizer = self.processor.tokenizer
        
    def chat(self, prompt, image):
        """Process an image or batch of images with the given prompt(s)
        
        Args:
            prompt: Text prompt or list of prompts to guide the model
            image: PIL Image or list of PIL Images to process
            
        Returns:
            Generated text or list of texts from the model
        """
        # Check if we're dealing with a batch
        is_batch = isinstance(image, list)
        
        if not is_batch:
            # Single image, wrap it in a list for consistent processing
            images = [image]
            prompts = [prompt]
        else:
            # Batch of images
            images = image
            prompts = prompt if isinstance(prompt, list) else [prompt] * len(images)
        
        # Prepare image
        batch_inputs = self.processor(images, return_tensors="pt", padding=True)
        batch_pixel_values = batch_inputs.pixel_values.half().to(self.device)
        
        # Prepare prompt
        prompts = [f"<s>{p} <Answer/>"for p in prompts]
        batch_prompt_inputs = self.tokenizer(
            prompts,
            add_special_tokens=False,
            return_tensors="pt"
        )

        batch_prompt_ids = batch_prompt_inputs.input_ids.to(self.device)
        batch_attention_mask = batch_prompt_inputs.attention_mask.to(self.device)
        
        # Generate text
        outputs = self.model.generate(
            pixel_values=batch_pixel_values,
            decoder_input_ids=batch_prompt_ids,
            decoder_attention_mask=batch_attention_mask,
            min_length=1,
            max_length=4096,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
            use_cache=True,
            bad_words_ids=[[self.tokenizer.unk_token_id]],
            return_dict_in_generate=True,
            do_sample=False,
            num_beams=1,
            repetition_penalty=1.1
        )
        
        # Process output
        sequences = self.tokenizer.batch_decode(outputs.sequences, skip_special_tokens=False)
        
        # Clean prompt text from output
        results = []
        for i, sequence in enumerate(sequences):
            cleaned = sequence.replace(prompts[i], "").replace("<pad>", "").replace("</s>", "").strip()
            results.append(cleaned)
            
        # Return a single result for single image input
        if not is_batch:
            return results[0]
        return results


def process_page(image_path, model, save_dir, max_batch_size=None):
    """Parse document images with two stages"""
    # Stage 1: Page-level layout and reading order parsing
    pil_image = Image.open(image_path).convert("RGB")
    layout_output = model.chat("Parse the reading order of this document.", pil_image)

    # Stage 2: Element-level content parsing
    padded_image, dims = prepare_image(pil_image)
    recognition_results = process_elements(layout_output, padded_image, dims, model, max_batch_size)

    # Save outputs
    json_path = save_outputs(recognition_results, image_path, save_dir)

    return json_path, recognition_results


def process_elements(layout_results, padded_image, dims, model, max_batch_size=None):
    """Parse all document elements with parallel decoding"""
    layout_results = parse_layout_string(layout_results)

    # Store text and table elements separately
    text_elements = []  # Text elements
    table_elements = []  # Table elements
    figure_results = []  # Image elements (no processing needed)
    previous_box = None
    reading_order = 0

    # Collect elements to process and group by type
    for bbox, label in layout_results:
        try:
            # Adjust coordinates
            x1, y1, x2, y2, orig_x1, orig_y1, orig_x2, orig_y2, previous_box = process_coordinates(
                bbox, padded_image, dims, previous_box
            )

            # Crop and parse element
            cropped = padded_image[y1:y2, x1:x2]
            if cropped.size > 0:
                if label == "fig":
                    # For figure regions, add empty text result immediately
                    figure_results.append(
                        {
                            "label": label,
                            "bbox": [orig_x1, orig_y1, orig_x2, orig_y2],
                            "text": "",
                            "reading_order": reading_order,
                        }
                    )
                else:
                    # Prepare element for parsing
                    pil_crop = Image.fromarray(cv2.cvtColor(cropped, cv2.COLOR_BGR2RGB))
                    element_info = {
                        "crop": pil_crop,
                        "label": label,
                        "bbox": [orig_x1, orig_y1, orig_x2, orig_y2],
                        "reading_order": reading_order,
                    }
                    
                    # Group by type
                    if label == "tab":
                        table_elements.append(element_info)
                    else:  # Text elements
                        text_elements.append(element_info)

            reading_order += 1

        except Exception as e:
            print(f"Error processing bbox with label {label}: {str(e)}")
            continue

    # Initialize results list
    recognition_results = figure_results.copy()
    
    # Process text elements (in batches)
    if text_elements:
        text_results = process_element_batch(text_elements, model, "Read text in the image.", max_batch_size)
        recognition_results.extend(text_results)
    
    # Process table elements (in batches)
    if table_elements:
        table_results = process_element_batch(table_elements, model, "Parse the table in the image.", max_batch_size)
        recognition_results.extend(table_results)

    # Sort elements by reading order
    recognition_results.sort(key=lambda x: x.get("reading_order", 0))

    return recognition_results


def process_element_batch(elements, model, prompt, max_batch_size=None):
    """Process elements of the same type in batches"""
    results = []
    
    # Determine batch size
    batch_size = len(elements)
    if max_batch_size is not None and max_batch_size > 0:
        batch_size = min(batch_size, max_batch_size)
    
    # Process in batches
    for i in range(0, len(elements), batch_size):
        batch_elements = elements[i:i+batch_size]
        crops_list = [elem["crop"] for elem in batch_elements]
        
        # Use the same prompt for all elements in the batch
        prompts_list = [prompt] * len(crops_list)
        
        # Batch inference
        batch_results = model.chat(prompts_list, crops_list)
        
        # Add results
        for j, result in enumerate(batch_results):
            elem = batch_elements[j]
            results.append({
                "label": elem["label"],
                "bbox": elem["bbox"],
                "text": result.strip(),
                "reading_order": elem["reading_order"],
            })
    
    return results


def main():
    parser = argparse.ArgumentParser(descriptinotallow="Document processing tool using DOLPHIN model")
    parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
    parser.add_argument("--input_path", type=str, default="./demo", help="Path to input image or directory of images")
    parser.add_argument(
        "--save_dir",
        type=str,
        default=None,
        help="Directory to save parsing results (default: same as input directory)",
    )
    parser.add_argument(
        "--max_batch_size",
        type=int,
        default=16,
        help="Maximum number of document elements to parse in a single batch (default: 16)",
    )
    args = parser.parse_args()

    # Load Model
    model = DOLPHIN(args.model_path)

    # Collect Document Images
    if os.path.isdir(args.input_path):
        image_files = []
        for ext in [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG"]:
            image_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
        image_files = sorted(image_files)
    else:
        if not os.path.exists(args.input_path):
            raise FileNotFoundError(f"Input path {args.input_path} does not exist")
        image_files = [args.input_path]

    save_dir = args.save_dir or (
        args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
    )
    setup_output_dirs(save_dir)

    total_samples = len(image_files)
    print(f"\nTotal samples to process: {total_samples}")

    # Process All Document Images
    for image_path in image_files:
        print(f"\nProcessing {image_path}")
        try:
            json_path, recognition_results = process_page(
                image_path=image_path,
                model=model,
                save_dir=save_dir,
                max_batch_size=args.max_batch_size,
            )

            print(f"Processing completed. Results saved to {save_dir}")

        except Exception as e:
            print(f"Error processing {image_path}: {str(e)}")
            continue


if __name__ == "__main__":
    main()

元素級解析

import argparse
import glob
import os

import torch
from PIL import Image
from transformers import AutoProcessor, VisionEncoderDecoderModel

from utils.utils import *


class DOLPHIN:
    def __init__(self, model_id_or_path):
        """Initialize the Hugging Face model
        
        Args:
            model_id_or_path: Path to local model or Hugging Face model ID
        """
        # Load model from local path or Hugging Face hub
        self.processor = AutoProcessor.from_pretrained(model_id_or_path)
        self.model = VisionEncoderDecoderModel.from_pretrained(model_id_or_path)
        self.model.eval()
        
        # Set device and precision
        self.device = "cuda"if torch.cuda.is_available() else"cpu"
        self.model.to(self.device)
        self.model = self.model.half()  # Always use half precision by default
        
        # set tokenizer
        self.tokenizer = self.processor.tokenizer
        
    def chat(self, prompt, image):
        """Process an image with the given prompt
        
        Args:
            prompt: Text prompt to guide the model
            image: PIL Image to process
            
        Returns:
            Generated text from the model
        """
        # Prepare image
        pixel_values = self.processor(image, return_tensors="pt").pixel_values
        pixel_values = pixel_values.half()
            
        # Prepare prompt
        prompt = f"<s>{prompt} <Answer/>"
        prompt_ids = self.tokenizer(
            prompt, 
            add_special_tokens=False, 
            return_tensors="pt"
        ).input_ids.to(self.device)
        
        decoder_attention_mask = torch.ones_like(prompt_ids)
        
        # Generate text
        outputs = self.model.generate(
            pixel_values=pixel_values.to(self.device),
            decoder_input_ids=prompt_ids,
            decoder_attention_mask=decoder_attention_mask,
            min_length=1,
            max_length=4096,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
            use_cache=True,
            bad_words_ids=[[self.tokenizer.unk_token_id]],
            return_dict_in_generate=True,
            do_sample=False,
            num_beams=1,
        )
        
        # Process the output
        sequence = self.tokenizer.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
        sequence = sequence.replace(prompt, "").replace("<pad>", "").replace("</s>", "").strip()
        
        return sequence

def process_element(image_path, model, element_type, save_dir=None):
    """Process a single element image (text, table, formula)
    
    Args:
        image_path: Path to the element image
        model: HFModel model instance
        element_type: Type of element ('text', 'table', 'formula')
        save_dir: Directory to save results (default: same as input directory)
        
    Returns:
        Parsed content of the element and recognition results
    """
    # Load and prepare image
    pil_image = Image.open(image_path).convert("RGB")
    pil_image = crop_margin(pil_image)
    
    # Select appropriate prompt based on element type
    if element_type == "table":
        prompt = "Parse the table in the image."
        label = "tab"
    elif element_type == "formula":
        prompt = "Read text in the image."
        label = "formula"
    else:  # Default to text
        prompt = "Read text in the image."
        label = "text"
    
    # Process the element
    result = model.chat(prompt, pil_image)
    
    # Create recognition result in the same format as the document parser
    recognition_result = [
        {
            "label": label,
            "text": result.strip(),
        }
    ]
    
    # Save results if save_dir is provided
    if save_dir:
        save_outputs(recognition_result, image_path, save_dir)
        print(f"Results saved to {save_dir}")
    
    return result, recognition_result


def main():
    parser = argparse.ArgumentParser(descriptinotallow="Element-level processing using DOLPHIN model")
    parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
    parser.add_argument("--input_path", type=str, required=True, help="Path to input image or directory of images")
    parser.add_argument(
        "--element_type",
        type=str,
        choices=["text", "table", "formula"],
        default="text",
        help="Type of element to process (text, table, formula)",
    )
    parser.add_argument(
        "--save_dir",
        type=str,
        default=None,
        help="Directory to save parsing results (default: same as input directory)",
    )
    parser.add_argument("--print_results", actinotallow="store_true", help="Print recognition results to console")
    args = parser.parse_args()
    
    # Load Model
    model = DOLPHIN(args.model_path)
    
    # Set save directory
    save_dir = args.save_dir or (
        args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
    )
    setup_output_dirs(save_dir)
    
    # Collect Images
    if os.path.isdir(args.input_path):
        image_files = []
        for ext in [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG"]:
            image_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
        image_files = sorted(image_files)
    else:
        if not os.path.exists(args.input_path):
            raise FileNotFoundError(f"Input path {args.input_path} does not exist")
        image_files = [args.input_path]
    
    total_samples = len(image_files)
    print(f"\nTotal samples to process: {total_samples}")
    
    # Process images one by one
    for image_path in image_files:
        print(f"\nProcessing {image_path}")
        try:
            result, recognition_result = process_element(
                image_path=image_path,
                model=model,
                element_type=args.element_type,
                save_dir=save_dir,
            )

            if args.print_results:
                print("\nRecognition result:")
                print(result)
                print("-" * 40)
        except Exception as e:
            print(f"Error processing {image_path}: {str(e)}")
            continue


if __name__ == "__main__":
    main()

??https://github.com/bytedance/Dolphin??
??https://arxiv.org/pdf/2505.14059??
??https://huggingface.co/ByteDance/Dolphin??

本文轉載自????CourseAI????，作者：CourseAI

標簽

NLP

RAG

贊

回復

舉報

回復

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

51CTO

51CTO博客

51CTO學堂

字節跳動2步突破，復雜文檔布局解析，為啥如此驚艷？

一、現有方案的局限性

Dolphin案例展示

二、Dolphin解決方案

2.1. 頁面級布局分析階段

2.2 元素級內容解析階段

訓練方案

Dolphin實戰

目錄