實(shí)戰(zhàn) LLaMA Factory：在國(guó)產(chǎn)DCU上高效微調(diào) Llama 3 模型

發(fā)布于 2025-6-5 06:55

瀏覽

0收藏

一、前言

隨著大語(yǔ)言模型（LLM）的飛速發(fā)展，如何在特定領(lǐng)域或任務(wù)上對(duì)預(yù)訓(xùn)練模型進(jìn)行高效微調(diào)，已成為業(yè)界關(guān)注的焦點(diǎn)。LLaMA Factory 作為一個(gè)功能強(qiáng)大且易于上手的 LLM 微調(diào)框架，受到了廣泛關(guān)注。本文將聚焦于在國(guó)產(chǎn) DCU 平臺(tái)上，利用 LLaMA Factory 對(duì) Llama 3 模型進(jìn)行 LoRA 微調(diào)的實(shí)踐過(guò)程，并分享其中的關(guān)鍵步驟與經(jīng)驗(yàn)。

?? 海光DCU實(shí)戰(zhàn)項(xiàng)目來(lái)了！助您輕松駕馭大模型與HPC開發(fā) ??

為幫助開發(fā)者更便捷在海光DCU上進(jìn)行大模型（訓(xùn)練、微調(diào)、推理）及科學(xué)計(jì)算，我依托海光DCU開發(fā)者社區(qū)，精心打造了一個(gè)開箱即用的實(shí)戰(zhàn)項(xiàng)目 —— “dcu-in-action”！

旨在為您提供：

? ?? 直接上手的代碼示例與實(shí)踐指南

? ? 加速您在海光DCU上的開發(fā)與部署流程

歡迎各位開發(fā)者：

? 訪問(wèn)項(xiàng)目GitHub倉(cāng)庫(kù)，深入體驗(yàn)、參與貢獻(xiàn)，共同完善： https://github.com/FlyAIBox/dcu-in-action

? 如果項(xiàng)目對(duì)您有幫助，請(qǐng)我們點(diǎn)亮一個(gè)寶貴的 Star ??

二、環(huán)境準(zhǔn)備與 LLaMA Factory 安裝

本次實(shí)踐的環(huán)境基于國(guó)產(chǎn)海光 DCU K100-AI，DTK 版本為 25.04。核心軟件棧包括 Python 3.10 以及針對(duì) DCU 優(yōu)化的 PyTorch (torch==2.4.1+das.opt2.dtk2504) 及其相關(guān)深度學(xué)習(xí)庫(kù)（如 lmslim, flash-attn,vllm,deepspeed 的特定版本）。

1. 創(chuàng)建虛擬環(huán)境

conda create -n dcu_llm_fine python=3.10
conda activate dcu_llm_fine

2. 安裝 DCU 特定深度學(xué)習(xí)庫(kù)

根據(jù)文檔指引，從光合開發(fā)者社區(qū)下載并安裝適配 DCUK100-AI (DTK 25.04, Python 3.10) 的 PyTorch, lmslim,flash-attn, vllm deepspeed 等 whl 包。確保各組件版本嚴(yán)格對(duì)應(yīng)。

3. 安裝 LLaMA Factory

git clone http://developer.hpccube.com/codes/OpenDAS/llama-factory.git
cd /your_code_path/llama_factory
pip install -e ".[torch,metrics]"

注意：如遇包沖突，可嘗試 pip install --no-deps -e .。

三、Llama 3 LoRA 微調(diào)實(shí)戰(zhàn)

我們以 Meta-Llama-3-8B-Instruct 模型為例，采用 LoRA (Low-Rank Adaptation) 方法進(jìn)行監(jiān)督式微調(diào) (SFT)。

1. 微調(diào)配置文件解析 (llama3_lora_sft.yaml)

以下是核心配置參數(shù)：

### model
model_name_or_path:/root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct# 模型路徑
trust_remote_code:true

### method
stage:sft                      # 微調(diào)階段：監(jiān)督式微調(diào)
do_train:true
finetuning_type:lora           # 微調(diào)方法：LoRA
lora_rank:8                    # LoRA 秩
lora_target:all                # LoRA 應(yīng)用目標(biāo)：所有線性層

### dataset
dataset:identity,alpaca_en_demo# 使用的數(shù)據(jù)集
template:llama3                # 對(duì)話模板
cutoff_len:2048                # 序列截?cái)嚅L(zhǎng)度
max_samples:1000               # 每個(gè)數(shù)據(jù)集最大樣本數(shù)
overwrite_cache:true
preprocessing_num_workers:16   # 預(yù)處理進(jìn)程數(shù)

### output
output_dir:saves/llama3-8b/lora/sft# 輸出目錄
logging_steps:10
save_steps:500
plot_loss:true
overwrite_output_dir:true
save_only_model:false          # 保存完整checkpoint而非僅模型

### train
per_device_train_batch_size:1# 每GPU批大小
gradient_accumulation_steps:8# 梯度累積步數(shù)
learning_rate:1.0e-4           # 學(xué)習(xí)率
num_train_epochs:3.0           # 訓(xùn)練輪次
lr_scheduler_type:cosine       # 學(xué)習(xí)率調(diào)度器
warmup_ratio:0.1               # 預(yù)熱比例
bf16:true                      # 使用bf16混合精度
ddp_timeout:180000000
resume_from_checkpoint: null

2. 啟動(dòng)微調(diào)

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

3. 微調(diào)過(guò)程關(guān)鍵日志輸出與解讀

環(huán)境初始化與分布式設(shè)置 (日志時(shí)間: 21:16:40 - 21:16:51)

? Setting ds_accelerator to cuda (auto detect)

? Initializing 8 distributed tasks at: 127.0.0.1:54447

? 各 GPU 進(jìn)程 (如 [PG 0 Rank 2]) 初始化 NCCL，日志顯示 size: 8, global rank: 2, TIMEOUT(ms): 180000000000。

? 各進(jìn)程確認(rèn)信息，例如 Process rank: 2, world size: 8, device: cuda:2, distributed training: True, compute dtype: torch.bfloat16?，表明已啟用 bf16 混合精度。

? Set ddp_find_unused_parameters to False in DDP training since LoRA is enabled.

Tokenizer 與模型配置加載 (日志時(shí)間: 21:16:51 - 21:16:52)

? 加載 tokenizer.json, tokenizer.model 等文件。

? 加載模型配置文件 /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct/config.json，確認(rèn)模型架構(gòu)如 hidden_size: 4096, num_hidden_layers: 32, torch_dtype: "bfloat16"。

數(shù)據(jù)集加載與預(yù)處理 (日志時(shí)間: 21:16:52 - 21:17:01)

? 加載數(shù)據(jù)集 identity.json (91條樣本) 和 alpaca_en_demo.json (1000條樣本)。

? Converting format of dataset (num_proc=16) 和 Running tokenizer on dataset (num_proc=16)，共處理 1091 條樣本。

? 展示了處理后的一個(gè)訓(xùn)練樣本 training example，包括 input_ids, inputs (已格式化模板) 和 label_ids (prompt部分為-100)。

基礎(chǔ)模型權(quán)重加載與 LoRA 適配器設(shè)置 (日志時(shí)間: 21:17:01 - 21:17:16)

? KV cache is disabled during training.

? 加載模型權(quán)重 /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct/model.safetensors.index.json，共4個(gè)分片。

? 出現(xiàn)警告: Using the SDPA attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.

? Gradient checkpointing enabled.

? Fine-tuning method: LoRA

? Found linear modules: v_proj,q_proj,k_proj,down_proj,o_proj,gate_proj,up_proj (這些是 lora_target: all 選中的層)。

? trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605，明確了 LoRA 引入的可訓(xùn)練參數(shù)量和占比。

Trainer 初始化與訓(xùn)練循環(huán) (日志時(shí)間: 21:17:16 - 21:22:15)

? ***** Running training *****

? Num examples = 1,091, Num Epochs = 3

? Instantaneous batch size per device = 1, Total train batch size (w. parallel, distributed & accumulation) = 64

? Gradient Accumulation steps = 8, Total optimization steps = 51

? 訓(xùn)練日志周期性輸出 (每 logging_steps: 10次迭代，但日志中是按優(yōu)化步聚合后展示的)：

{'loss': 1.4091, 'grad_norm': 1.0385..., 'learning_rate': 9.8063...e-05, 'epoch': 0.58}

{'loss': 1.0404, 'grad_norm': 0.6730..., 'learning_rate': 7.7959...e-05, 'epoch': 1.17}

{'loss': 0.9658, 'grad_norm': 0.4174..., 'learning_rate': 4.4773...e-05, 'epoch': 1.75}

{'loss': 0.9389, 'grad_norm': 0.3942..., 'learning_rate': 1.4033...e-05, 'epoch': 2.34}

{'loss': 0.894, 'grad_norm': 0.4427..., 'learning_rate': 1.2179...e-07, 'epoch': 2.92}

? 訓(xùn)練過(guò)程中反復(fù)出現(xiàn) UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)

訓(xùn)練完成與模型保存 (日志時(shí)間: 15:22:15 - 15:22:17)

? Saving model checkpoint to saves/llama3-8b/lora/sft/checkpoint-51

? 最終訓(xùn)練指標(biāo) ***** train metrics *****：

epoch = 2.9781

train_loss = 1.0481

train_runtime = 0:04:56.32 (即 296.3281秒)

train_samples_per_second = 11.045

train_steps_per_second = 0.172

? Figure saved at: saves/llama3-8b/lora/sft/training_loss.png

? NCCL 通信器關(guān)閉，各進(jìn)程資源清理。

四、模型推理測(cè)試

微調(diào)完成后，我們加載 LoRA 適配器進(jìn)行推理測(cè)試。

1. 推理配置文件 (llama3_lora_sft.yaml for inference)

model_name_or_path: /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft # 加載微調(diào)后的LoRA適配器
template: llama3
infer_backend: huggingface # 推理后端
trust_remote_code: true

2. 啟動(dòng)推理

llamafactory-cli chat examples/inference/llama3_lora_sft.yaml

3. 推理過(guò)程關(guān)鍵日志輸出與測(cè)試結(jié)果

模型加載 (日志時(shí)間: 17:30:16 - 17:31:18)

? 加載基礎(chǔ)模型 Tokenizer, config (torch_dtype: "bfloat16", use_cache: true) 和權(quán)重 (model.safetensors.index.json, 4個(gè)分片)。

? KV cache is enabled for faster generation.

? 再次出現(xiàn) SDPA on ROCm 性能警告。

? 加載 LoRA 適配器: Loaded adapter(s): saves/llama3-8b/lora/sft。

? Merged 1 adapter(s).，確認(rèn) LoRA 權(quán)重已合并到基礎(chǔ)模型。

? 加載后模型參數(shù)量 all params: 8,030,261,248。

交互測(cè)試結(jié)果

? User:

你是誰(shuí)

Assistant:

我是 {{name}}，由 {{author}} 訓(xùn)練的 AI 助手。我旨在為您提供幫助，回答問(wèn)題和完成任務(wù)。

評(píng)析：輸出中的 {{name}}? 和 {{author}}? 占位符，表明模型學(xué)習(xí)了微調(diào)數(shù)據(jù) identity.json 中的模板格式。

五、模型導(dǎo)出

將微調(diào)后的 LoRA 權(quán)重與基礎(chǔ)模型合并，并導(dǎo)出為獨(dú)立模型。

1. 導(dǎo)出配置文件 (llama3_lora_sft.yaml for export)

### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
model_name_or_path:/root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct
adapter_name_or_path:saves/llama3-8b/lora/sft
template:llama3
trust_remote_code:true

### export
export_dir:output/llama3_lora_sft# 導(dǎo)出目錄
export_size:5                     # 模型分片大小上限 (GB)
export_device:cpu                 # 導(dǎo)出時(shí)使用的設(shè)備
export_legacy_format:false        # 不使用舊格式，優(yōu)先safetensors

重要提示：配置文件中明確指出，合并 LoRA 適配器時(shí)不應(yīng)使用已量化的模型。

2. 啟動(dòng)導(dǎo)出

llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

3. 導(dǎo)出過(guò)程關(guān)鍵日志輸出 (日志時(shí)間: 18:06:54 - 18:08:22)

? 加載基礎(chǔ)模型 Tokenizer, config (torch_dtype: "bfloat16") 和權(quán)重 (4個(gè)分片)。

? 加載 LoRA 適配器: Loaded adapter(s): saves/llama3-8b/lora/sft。

? Merged 1 adapter(s).，LoRA 權(quán)重與基礎(chǔ)模型合并。

? Convert model dtype to: torch.bfloat16.

? 配置文件保存: Configuration saved in output/llama3_lora_sft/config.json 和 output/llama3_lora_sft/generation_config.json。

? 模型權(quán)重保存: The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at output/llama3_lora_sft/model.safetensors.index.json. (根據(jù)配置 export_size: 5)

? Tokenizer 文件保存: tokenizer config file saved in output/llama3_lora_sft/tokenizer_config.json 和 special_tokens_map.json。

? 額外功能: Ollama modelfile saved in output/llama3_lora_sft/Modelfile。

七、總結(jié)與展望

本次實(shí)踐完整地展示了使用 LLaMA Factory 在國(guó)產(chǎn) DCU 平臺(tái)上對(duì) Llama 3 模型進(jìn)行 LoRA 微調(diào)、推理和導(dǎo)出的流程。LLaMA Factory 憑借其清晰的配置和便捷的命令行工具，顯著降低了 LLM 微調(diào)的門檻。通過(guò)對(duì)各階段關(guān)鍵日志輸出和測(cè)試信息的詳細(xì)解讀，我們可以更直觀地把握模型在訓(xùn)練中的學(xué)習(xí)動(dòng)態(tài)、在推理中的行為表現(xiàn)以及導(dǎo)出后的結(jié)構(gòu)。

本文轉(zhuǎn)載自 ?????螢火AI百寶箱??????，作者：螢火AI百寶箱

標(biāo)簽

DCU

Llama 3

模型

贊

回復(fù)