字節跳動2步突破,復雜文檔布局解析,為啥如此驚艷?
一、現有方案的局限性
現有的文檔圖像解析解決方案主要分為兩大類:基于集成的方法和端到端的方法。
- 基于集成的方法通過將多個專家模型組裝到一個多階段的流水線中來實現文檔解析,這些方法雖然在特定任務上表現出色,但需要對每個模型進行獨立優化,并且在組件間協調方面面臨挑戰。
- 端到端的方法則利用通用或專家視覺語言模型(VLMs)直接自回歸地生成頁面級內容,雖然能夠捕捉頁面級語義,但在解析長文檔和復雜布局時,常常會遇到布局結構退化和效率瓶頸的問題。
Dolphin案例展示
- 版式識別、閱讀順序識別
- 公式識別
- 給定box區域提取區域的內容
- 復雜的表格可以輕松轉出markdown
- 無線表格輕松拿捏
二、Dolphin解決方案
Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting)采用了一種分析-解析范式(analyze-then-parse),將文檔解析過程分解為兩個階段:
- 第一階段進行頁面級布局分析,生成自然閱讀順序的布局元素序列;
- 第二階段則利用這些元素作為錨點,結合任務特定的提示,進行并行的內容解析。
這種兩階段的設計既避免了傳統集成方法中多模型協調的復雜性,又克服了端到端方法在復雜布局和長文檔解析中的效率瓶頸,還能通過輕量級架構和并行解析機制實現優越的運行效率。
- Dolphin支持的布局元素
2.1. 頁面級布局分析階段
- 頁面圖像編碼
Dolphin 使用 Swin Transformer 作為視覺編碼器,將輸入的頁面圖像編碼為視覺嵌入序列。
Swin Transformer 的層次化設計能夠同時捕捉全局布局模式和局部文本細節。
輸入圖像在編碼前會被調整并填充到固定大小,以保持其寬高比,避免文本失真。
- 布局序列生成
- 在布局序列生成過程中,解碼器以布局分析提示(Playout)為引導,通過交叉注意力機制關注編碼后的視覺特征。
- 解碼器采用 mBart 架構,能夠識別并按順序排列文檔元素,同時保留結構關系(例如圖表與標題的配對、表格與標題的關聯以及章節標題與段落的層次結構)。
- 最終生成的布局元素序列 包含每個元素的類型(如圖表、標題、表格、段落)和邊界框信息,這些元素將作為第二階段的錨點。
L = {l1, l2, ..., ln}
2.2 元素級內容解析階段
- 元素圖像編碼
對于第一階段識別出的每個布局元素 li
Dolphin 從原始圖像中裁剪出對應的區域,形成局部視圖 Ii。
這些局部視圖通過相同的 Swin Transformer 并行編碼,生成元素特定的視覺特征。
- 并行內容解析
- 在并行內容解析階段,Dolphin 利用類型特定的提示來指導不同元素的解析。
- 例如,表格使用專用的表格提示(Ptable)來解析其 HTML 格式
- 而公式則與文本段落共享相同的提示(Pparagraph),因為它們通常在段落上下文中以行內和顯示模式出現,盡管它們的標記格式為 LaTeX。
- 給定局部視圖 Ii 的視覺特征及其對應的提示 pi,解碼器能夠并行生成解析后的內容。這種并行處理策略結合元素特定的提示,確保了計算效率,同時保持了準確的內容識別。
訓練方案
- 構建數據集
包含超過 3000 萬個樣本的大規模數據集,涵蓋了頁面級文檔和元素級組件。
數據集的來源包括混合文檔、HTML 文檔、LaTeX 文檔、Markdown 文檔、表格和公式等。
這些數據通過不同的方式進行了處理和標注,以滿足模型訓練的不同需求。
- 訓練過程中
Dolphin 采用了動態任務選擇策略,根據訓練樣本的可用標注隨機選擇適用的任務,從而構建問題-答案對。
這種策略能夠提高模型的泛化能力,使其能夠處理多種類型的文檔解析任務。
還采用了預訓練權重初始化的方法,通過在 Donut 模型的基礎上進行指令調優,擴展了模型對多樣化提示的理解和執行能力。
Dolphin實戰
通過這個鏈接可以免費試用?
?http://115.190.42.15:8888/dolphin/?
?
Dolphin 提供了兩個推理框架,支持兩個解析粒度:
- 頁面級解析:將整個文檔圖像解析為結構化的 JSON 和 Markdown 格式
- 元素級解析:解析單個文檔元素(文本、表格、公式)
- 頁面解析
import argparse
import glob
import os
import cv2
import torch
from PIL import Image
from transformers import AutoProcessor, VisionEncoderDecoderModel
from utils.utils import *
class DOLPHIN:
def __init__(self, model_id_or_path):
"""Initialize the Hugging Face model
Args:
model_id_or_path: Path to local model or Hugging Face model ID
"""
# Load model from local path or Hugging Face hub
self.processor = AutoProcessor.from_pretrained(model_id_or_path)
self.model = VisionEncoderDecoderModel.from_pretrained(model_id_or_path)
self.model.eval()
# Set device and precision
self.device = "cuda"if torch.cuda.is_available() else"cpu"
self.model.to(self.device)
self.model = self.model.half() # Always use half precision by default
# set tokenizer
self.tokenizer = self.processor.tokenizer
def chat(self, prompt, image):
"""Process an image or batch of images with the given prompt(s)
Args:
prompt: Text prompt or list of prompts to guide the model
image: PIL Image or list of PIL Images to process
Returns:
Generated text or list of texts from the model
"""
# Check if we're dealing with a batch
is_batch = isinstance(image, list)
if not is_batch:
# Single image, wrap it in a list for consistent processing
images = [image]
prompts = [prompt]
else:
# Batch of images
images = image
prompts = prompt if isinstance(prompt, list) else [prompt] * len(images)
# Prepare image
batch_inputs = self.processor(images, return_tensors="pt", padding=True)
batch_pixel_values = batch_inputs.pixel_values.half().to(self.device)
# Prepare prompt
prompts = [f"<s>{p} <Answer/>"for p in prompts]
batch_prompt_inputs = self.tokenizer(
prompts,
add_special_tokens=False,
return_tensors="pt"
)
batch_prompt_ids = batch_prompt_inputs.input_ids.to(self.device)
batch_attention_mask = batch_prompt_inputs.attention_mask.to(self.device)
# Generate text
outputs = self.model.generate(
pixel_values=batch_pixel_values,
decoder_input_ids=batch_prompt_ids,
decoder_attention_mask=batch_attention_mask,
min_length=1,
max_length=4096,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
use_cache=True,
bad_words_ids=[[self.tokenizer.unk_token_id]],
return_dict_in_generate=True,
do_sample=False,
num_beams=1,
repetition_penalty=1.1
)
# Process output
sequences = self.tokenizer.batch_decode(outputs.sequences, skip_special_tokens=False)
# Clean prompt text from output
results = []
for i, sequence in enumerate(sequences):
cleaned = sequence.replace(prompts[i], "").replace("<pad>", "").replace("</s>", "").strip()
results.append(cleaned)
# Return a single result for single image input
if not is_batch:
return results[0]
return results
def process_page(image_path, model, save_dir, max_batch_size=None):
"""Parse document images with two stages"""
# Stage 1: Page-level layout and reading order parsing
pil_image = Image.open(image_path).convert("RGB")
layout_output = model.chat("Parse the reading order of this document.", pil_image)
# Stage 2: Element-level content parsing
padded_image, dims = prepare_image(pil_image)
recognition_results = process_elements(layout_output, padded_image, dims, model, max_batch_size)
# Save outputs
json_path = save_outputs(recognition_results, image_path, save_dir)
return json_path, recognition_results
def process_elements(layout_results, padded_image, dims, model, max_batch_size=None):
"""Parse all document elements with parallel decoding"""
layout_results = parse_layout_string(layout_results)
# Store text and table elements separately
text_elements = [] # Text elements
table_elements = [] # Table elements
figure_results = [] # Image elements (no processing needed)
previous_box = None
reading_order = 0
# Collect elements to process and group by type
for bbox, label in layout_results:
try:
# Adjust coordinates
x1, y1, x2, y2, orig_x1, orig_y1, orig_x2, orig_y2, previous_box = process_coordinates(
bbox, padded_image, dims, previous_box
)
# Crop and parse element
cropped = padded_image[y1:y2, x1:x2]
if cropped.size > 0:
if label == "fig":
# For figure regions, add empty text result immediately
figure_results.append(
{
"label": label,
"bbox": [orig_x1, orig_y1, orig_x2, orig_y2],
"text": "",
"reading_order": reading_order,
}
)
else:
# Prepare element for parsing
pil_crop = Image.fromarray(cv2.cvtColor(cropped, cv2.COLOR_BGR2RGB))
element_info = {
"crop": pil_crop,
"label": label,
"bbox": [orig_x1, orig_y1, orig_x2, orig_y2],
"reading_order": reading_order,
}
# Group by type
if label == "tab":
table_elements.append(element_info)
else: # Text elements
text_elements.append(element_info)
reading_order += 1
except Exception as e:
print(f"Error processing bbox with label {label}: {str(e)}")
continue
# Initialize results list
recognition_results = figure_results.copy()
# Process text elements (in batches)
if text_elements:
text_results = process_element_batch(text_elements, model, "Read text in the image.", max_batch_size)
recognition_results.extend(text_results)
# Process table elements (in batches)
if table_elements:
table_results = process_element_batch(table_elements, model, "Parse the table in the image.", max_batch_size)
recognition_results.extend(table_results)
# Sort elements by reading order
recognition_results.sort(key=lambda x: x.get("reading_order", 0))
return recognition_results
def process_element_batch(elements, model, prompt, max_batch_size=None):
"""Process elements of the same type in batches"""
results = []
# Determine batch size
batch_size = len(elements)
if max_batch_size is not None and max_batch_size > 0:
batch_size = min(batch_size, max_batch_size)
# Process in batches
for i in range(0, len(elements), batch_size):
batch_elements = elements[i:i+batch_size]
crops_list = [elem["crop"] for elem in batch_elements]
# Use the same prompt for all elements in the batch
prompts_list = [prompt] * len(crops_list)
# Batch inference
batch_results = model.chat(prompts_list, crops_list)
# Add results
for j, result in enumerate(batch_results):
elem = batch_elements[j]
results.append({
"label": elem["label"],
"bbox": elem["bbox"],
"text": result.strip(),
"reading_order": elem["reading_order"],
})
return results
def main():
parser = argparse.ArgumentParser(descriptinotallow="Document processing tool using DOLPHIN model")
parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
parser.add_argument("--input_path", type=str, default="./demo", help="Path to input image or directory of images")
parser.add_argument(
"--save_dir",
type=str,
default=None,
help="Directory to save parsing results (default: same as input directory)",
)
parser.add_argument(
"--max_batch_size",
type=int,
default=16,
help="Maximum number of document elements to parse in a single batch (default: 16)",
)
args = parser.parse_args()
# Load Model
model = DOLPHIN(args.model_path)
# Collect Document Images
if os.path.isdir(args.input_path):
image_files = []
for ext in [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG"]:
image_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
image_files = sorted(image_files)
else:
if not os.path.exists(args.input_path):
raise FileNotFoundError(f"Input path {args.input_path} does not exist")
image_files = [args.input_path]
save_dir = args.save_dir or (
args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
)
setup_output_dirs(save_dir)
total_samples = len(image_files)
print(f"\nTotal samples to process: {total_samples}")
# Process All Document Images
for image_path in image_files:
print(f"\nProcessing {image_path}")
try:
json_path, recognition_results = process_page(
image_path=image_path,
model=model,
save_dir=save_dir,
max_batch_size=args.max_batch_size,
)
print(f"Processing completed. Results saved to {save_dir}")
except Exception as e:
print(f"Error processing {image_path}: {str(e)}")
continue
if __name__ == "__main__":
main()
- 元素級解析
import argparse
import glob
import os
import torch
from PIL import Image
from transformers import AutoProcessor, VisionEncoderDecoderModel
from utils.utils import *
class DOLPHIN:
def __init__(self, model_id_or_path):
"""Initialize the Hugging Face model
Args:
model_id_or_path: Path to local model or Hugging Face model ID
"""
# Load model from local path or Hugging Face hub
self.processor = AutoProcessor.from_pretrained(model_id_or_path)
self.model = VisionEncoderDecoderModel.from_pretrained(model_id_or_path)
self.model.eval()
# Set device and precision
self.device = "cuda"if torch.cuda.is_available() else"cpu"
self.model.to(self.device)
self.model = self.model.half() # Always use half precision by default
# set tokenizer
self.tokenizer = self.processor.tokenizer
def chat(self, prompt, image):
"""Process an image with the given prompt
Args:
prompt: Text prompt to guide the model
image: PIL Image to process
Returns:
Generated text from the model
"""
# Prepare image
pixel_values = self.processor(image, return_tensors="pt").pixel_values
pixel_values = pixel_values.half()
# Prepare prompt
prompt = f"<s>{prompt} <Answer/>"
prompt_ids = self.tokenizer(
prompt,
add_special_tokens=False,
return_tensors="pt"
).input_ids.to(self.device)
decoder_attention_mask = torch.ones_like(prompt_ids)
# Generate text
outputs = self.model.generate(
pixel_values=pixel_values.to(self.device),
decoder_input_ids=prompt_ids,
decoder_attention_mask=decoder_attention_mask,
min_length=1,
max_length=4096,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
use_cache=True,
bad_words_ids=[[self.tokenizer.unk_token_id]],
return_dict_in_generate=True,
do_sample=False,
num_beams=1,
)
# Process the output
sequence = self.tokenizer.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
sequence = sequence.replace(prompt, "").replace("<pad>", "").replace("</s>", "").strip()
return sequence
def process_element(image_path, model, element_type, save_dir=None):
"""Process a single element image (text, table, formula)
Args:
image_path: Path to the element image
model: HFModel model instance
element_type: Type of element ('text', 'table', 'formula')
save_dir: Directory to save results (default: same as input directory)
Returns:
Parsed content of the element and recognition results
"""
# Load and prepare image
pil_image = Image.open(image_path).convert("RGB")
pil_image = crop_margin(pil_image)
# Select appropriate prompt based on element type
if element_type == "table":
prompt = "Parse the table in the image."
label = "tab"
elif element_type == "formula":
prompt = "Read text in the image."
label = "formula"
else: # Default to text
prompt = "Read text in the image."
label = "text"
# Process the element
result = model.chat(prompt, pil_image)
# Create recognition result in the same format as the document parser
recognition_result = [
{
"label": label,
"text": result.strip(),
}
]
# Save results if save_dir is provided
if save_dir:
save_outputs(recognition_result, image_path, save_dir)
print(f"Results saved to {save_dir}")
return result, recognition_result
def main():
parser = argparse.ArgumentParser(descriptinotallow="Element-level processing using DOLPHIN model")
parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
parser.add_argument("--input_path", type=str, required=True, help="Path to input image or directory of images")
parser.add_argument(
"--element_type",
type=str,
choices=["text", "table", "formula"],
default="text",
help="Type of element to process (text, table, formula)",
)
parser.add_argument(
"--save_dir",
type=str,
default=None,
help="Directory to save parsing results (default: same as input directory)",
)
parser.add_argument("--print_results", actinotallow="store_true", help="Print recognition results to console")
args = parser.parse_args()
# Load Model
model = DOLPHIN(args.model_path)
# Set save directory
save_dir = args.save_dir or (
args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
)
setup_output_dirs(save_dir)
# Collect Images
if os.path.isdir(args.input_path):
image_files = []
for ext in [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG"]:
image_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
image_files = sorted(image_files)
else:
if not os.path.exists(args.input_path):
raise FileNotFoundError(f"Input path {args.input_path} does not exist")
image_files = [args.input_path]
total_samples = len(image_files)
print(f"\nTotal samples to process: {total_samples}")
# Process images one by one
for image_path in image_files:
print(f"\nProcessing {image_path}")
try:
result, recognition_result = process_element(
image_path=image_path,
model=model,
element_type=args.element_type,
save_dir=save_dir,
)
if args.print_results:
print("\nRecognition result:")
print(result)
print("-" * 40)
except Exception as e:
print(f"Error processing {image_path}: {str(e)}")
continue
if __name__ == "__main__":
main()
??https://github.com/bytedance/Dolphin??
本文轉載自????CourseAI????,作者:CourseAI
