一鍵生成完整配音視頻：UVR5 + 合成管道設計（集成 SadTalker + AnimateDiff）

作者：編程疏影 2025-05-14 07:35:27

通過將 UVR5 + Whisper + XTTSv2 + Aeneas + SadTalker/AnimateDiff + FFmpeg 全部打通，我們得以構建一個高度自動化的配音視頻生成系統 EasyDub。未來我們也將集成更多能力（如 ReRender、Multi-Speaker Management、角色語音庫等）來擴展 EasyDub 的可用性與創造力邊界。?

EasyDub 是一個用于自動化生成配音視頻的開源管道系統，支持從音頻人聲分離、語音識別、翻譯、語音合成、字幕同步、數字人生成到最終視頻封裝的全鏈路閉環。本文結合 Java（Spring Boot）與 Python 工具鏈，詳細展示如何使用 UVR5、Whisper、XTTSv2、FFmpeg、SadTalker 和 AnimateDiff 實現該功能。

使用 UVR5 分離人聲與伴奏

UVR5（Ultimate Vocal Remover v5）支持精準提取音樂中的人聲/伴奏：

python inference_main.py --model demucs_uvr --input input.mp4 --output outputs/

輸出：

outputs/input_Instrumental.wav伴奏
outputs/input_Vocals.wav人聲

語音識別 + 翻譯 + 合成：多語言鏈路

Whisper 識別人聲文本

from transformers import pipeline, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model_id ="openai/whisper-large-v3"
device ="cuda"if torch.cuda.is_available()else"cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to(device)
pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer)

result = pipe("outputs/input_Vocals.wav")
original_text = result["text"]

使用 Qwen 翻譯文本（如英→中）

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1.5-4B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1.5-4B")

inputs = tokenizer(f"翻譯為中文: {original_text}", return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

使用 XTTSv2 克隆音色生成語音

python tts_api.py --text"translated_text"--voice clone_sample.pth --output tts.wav

字幕同步與口型驅動：Aeneas + SadTalker

使用 Aeneas 自動生成 SRT 字幕

from aeneas.task import Task
from aeneas.executetask import ExecuteTask

task = Task(config_string="task_language=eng|is_text_type=plain|os_task_file_format=srt")
task.audio_file_path_absolute ="tts.wav"
task.text_file_path_absolute ="translated.txt"
task.sync_map_file_path_absolute ="output.srt"
ExecuteTask(task).execute()
task.output_sync_map_file()

使用 SadTalker 生成對口型數字人頭像視頻

python inference.py --driven_audio tts.wav --source_image headshot.png --result_dir ./result --still--preprocess full

輸出：生成包含對口型動畫的視頻，如 result/headshot_animated.mp4

集成 AnimateDiff 實現全身動畫數字人

AnimateDiff 是基于擴散模型的動作生成系統，可將 pose/control 信號轉為動態人物。

準備 Motion Prompt（如 Dance or Gesture）

使用開源工具或自己創建 .npz 動作數據，或直接用 T2M-GPT 等工具生成動作數據。

使用 AnimateDiff 執行生成：

python animate.py --text"你好，歡迎來到數字世界！"--motion motion_sequence.npz --output anim_frame_dir/ --tts_audio tts.wav

輸出：生成序列幀 anim_frame_dir/*.png

Spring Boot + FFmpeg 輸出音視頻

使用 FFmpeg 合成語音 + 視頻 + 字幕

合成最終視頻（口型動畫 + 合成語音 + 字幕）：

ffmpeg -i result/headshot_animated.mp4 -i tts.wav -c:v copy -c:a aac -map0:v:0 -map1:a:0 final_lipsync.mp4
ffmpeg -i final_lipsync.mp4 -vfsubtitles=output.srt final_output_with_subtitle.mp4

或合成 AnimateDiff 輸出幀：

ffmpeg -r24-i anim_frame_dir/%04d.png -i tts.wav -c:v libx264 -pix_fmt yuv420p final_fullbody.mp4

Java 封裝 FFmpeg 調用（com.icoderoad.easydub.service）

package com.icoderoad.easydub.service;


import org.springframework.stereotype.Service;
import java.io.IOException;


@Service
public class FFmpegService {


    public void combineVideoAudio(String videoPath, String audioPath, String outputPath) throws IOException, InterruptedException {
        String cmd = String.format("ffmpeg -i %s -i %s -c:v copy -c:a aac -strict experimental %s", videoPath, audioPath, outputPath);
        Process process = Runtime.getRuntime().exec(cmd);
        process.waitFor();
    }


    public void addSubtitle(String inputVideo, String subtitleFile, String outputVideo) throws IOException, InterruptedException {
        String cmd = String.format("ffmpeg -i %s -vf subtitles=%s %s", inputVideo, subtitleFile, outputVideo);
        Process process = Runtime.getRuntime().exec(cmd);
        process.waitFor();
    }
}

進階功能：視頻重混與多軌混音

音量調節：

ffmpeg -i tts.wav -filter:a"volume=2.0" louder.wav

背景音樂 + 旁白混合：

ffmpeg -i background.wav -i tts.wav -filter_complexamix=inputs=2:duration=longest mixed.wav

Spring Boot 調用同樣可封裝以上邏輯。

完整流程圖（簡述）

輸入視頻/音頻
    ↓
UVR5 音頻分離
    ↓
Whisper 語音識別 → 翻譯 → XTTSv2 合成
    ↓                            ↓
字幕生成（Aeneas）      合成音頻 → SadTalker/AnimateDiff
    ↓                            ↓
 FFmpeg 合成 → 輸出完整視頻

數字人合成模型選擇建議

模型	適用場景	是否支持音驅動	動作驅動方式
SadTalker	2D 頭像數字人	?	音頻驅動對口型
AnimateDiff	3D 全身數字人	?	Motion + Prompt

建議對外展示時結合兩者使用，SadTalker 負責特寫口型細節，AnimateDiff 展示全身動畫效果。

結語

責任編輯：武曉燕來源：路條編程

UVR5 合成管道集成

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看