LeCun轉贊：在蘋果M1/M2芯片上跑LLaMA！130億參數模型僅需4GB內存

作者：新智元 2023-03-13 15:42:00

人工智能新聞

現在，Meta最新的大語言模型LLaMA，可以在搭載蘋果芯片的Mac上跑了！

前不久，Meta前腳發布完開源大語言模型LLaMA，后腳就被網友放出了無門檻下載鏈接，「慘遭」開放。

消息一出，圈內瞬間就熱鬧了起來，大家紛紛開始下載測試。

但那些手頭沒有頂級顯卡的朋友們，就只能望模型興嘆了。

不過，問題不大。Georgi Gerganov在最近做了一個名為「llama.cpp」的項目——沒有GPU也能跑LLaMA。

項目地址：https://github.com/ggerganov/llama.cpp

是的，這也包括搭載了蘋果芯片的Mac。并且還獲得了LeCun的轉發支持。

在M1/M2的Mac上跑LLaMA

目前來說，比較全面的教程有兩個，分別基于蘋果的M1和M2處理器。

第一篇：https://dev.l1x.be/posts/2023/03/12/using-llama-with-m1-mac/

第二篇：https://til.simonwillison.net/llms/llama-7b-m2

話不多說，先看效果。

比如在最小的7B模型上提問：「第一個登上月球的人是」。

很快，幾秒之后，結果就出來了。

-p '第一個登上月球的人是'

第一個登上月球的人是38歲的宇航員Neil A. Armstrong。

1969年7月20日，阿波羅11號在月球上著陸。

阿波羅11號是第一個登陸月球的載人任務，也是太空競賽的高潮。1969年7月，Neil Armstrong和Edwin "Buzz" Aldrin成為第一批在月球上登陸的人類。

阿波羅11號于7月16日從肯尼迪航天中心發射。

從阿姆斯特朗登月的年齡、中間名和日期來看，并沒有出現明顯的事實性錯誤。這對于如此體量的小模型來說，實屬不易。

如果用下面這個提示，則可以生成了一些實用的Python代碼。

-p 'def open_and_return_content(filename):'

def open_and_return_content(filename):
    """    Opens file (returning the content) and performs basic sanity checks    """if os.path.isfile(filename):
        with open(filename) as f:
            content = f.read()
            return contentelse:
        print('WARNING: file "{}" does not exist'.format(filename), file=sys.stderr)
        return ''def get_file_info(filename, fullpath):
    """    Get file information (i.e., permission, owner, group, size)    """

接下來，我們就來看看具體是如何實現的。

第一步：下載模型

首先要做的就是下載LLaMA模型。

你可以通過官方的表格向Meta提交申請，或者從網友分享的鏈接里直接獲取。

總之，完成后你會看到下面這堆東西：

正如你所看到的，不同的模型都在不同的文件夾里。每個模型都有一個params.json，包含關于該模型的細節。比如：

第二步：安裝依賴項

首先，你需要安裝Xcode來編譯C++項目。

xcode-select --install

接下來，是構建C++項目的依賴項（pkgconfig和cmake）。

brew install pkgconfig cmake

在環境的配置上，假如你用的是Python 3.11，則可以創建一個虛擬環境：

/opt/homebrew/bin/python3.11 -m venv venv

然后激活venv。（如果是fish以外的shell，只要去掉.fish后綴即可）

. venv/bin/activate.fish

最后，安裝Torch。

pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu

如果你對利用新的Metal性能著色器（MPS）后端進行GPU訓練加速感興趣，可以通過運行以下程序來進行驗證。但這不是在M1上運行LLaMA的必要條件。

python
Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.backends.mps.is_available()True

第三步：編譯LLaMA CPP

git clone git@github.com:ggerganov/llama.cpp.git

在安裝完所有的依賴項后，你可以運行make：

make
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)
cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main  -framework Accelerate
./main -h
usage: ./main [options]
options:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  RNG seed (default: -1)  
  -t N, --threads N     number of threads to use during computation (default: 4)  
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)  
  -n N, --n_predict N   number of tokens to predict (default: 128)  
  --top_k N             top-k sampling (default: 40)  
  --top_p N             top-p sampling (default: 0.9)  
  --temp N              temperature (default: 0.8)  
  -b N, --batch_size N  batch size for prompt processing (default: 8)  
  -m FNAME, --model FNAME
                        model path (default: models/llama-7B/ggml-model.bin)
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize  -framework Accelerate

第四步：轉換模型

假設你已經把模型放在llama.cpp repo中的models/下。

python convert-pth-to-ggml.py models/7B 1

那么，應該會看到像這樣的輸出：

{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}n_parts =  1Processing part  0Processing variable: tok_embeddings.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
Processing variable: norm.weight with shape:  torch.Size([4096])  and type:  torch.float16
  Converting to float32
Processing variable: output.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
Processing variable: layers.0.attention.wq.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wk.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wv.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wo.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.feed_forward.w1.weight with shape:  torch.Size([11008, 4096])  and type:  tor
ch.float16
Processing variable: layers.0.feed_forward.w2.weight with shape:  torch.Size([4096, 11008])  and type:  tor
ch.float16
Processing variable: layers.0.feed_forward.w3.weight with shape:  torch.Size([11008, 4096])  and type:  tor
ch.float16
Processing variable: layers.0.attention_norm.weight with shape:  torch.Size([4096])  and type:  torch.float
16...
Done. Output file: models/7B/ggml-model-f16.bin, (part  0 )

下一步將是進行量化處理：

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

輸出如下：

llama_model_quantize: loading model from './models/7B/ggml-model-f16.bin'llama_model_quantize: n_vocab = 32000llama_model_quantize: n_ctx   = 512llama_model_quantize: n_embd  = 4096llama_model_quantize: n_mult  = 256llama_model_quantize: n_head  = 32llama_model_quantize: n_layer = 32llama_model_quantize: f16     = 1...
layers.31.attention_norm.weight - [ 4096,     1], type =    f32 size =    0.016 MB
layers.31.ffn_norm.weight - [ 4096,     1], type =    f32 size =    0.016 MB
llama_model_quantize: model size  = 25705.02 MB
llama_model_quantize: quant size  =  4017.27 MB
llama_model_quantize: hist: 0.000 0.022 0.019 0.033 0.053 0.078 0.104 0.125 0.134 0.125 0.104 0.078 0.053 0.033 0.019 0.022


main: quantize time = 29389.45 ms
main:    total time = 29389.45 ms

第五步：運行模型

./main -m ./models/7B/ggml-model-q4_0.bin \
        -t 8 \
        -n 128 \
        -p 'The first president of the USA was '

main: seed = 1678615879llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000llama_model_load: n_ctx   = 512llama_model_load: n_embd  = 4096llama_model_load: n_mult  = 256llama_model_load: n_head  = 32llama_model_load: n_layer = 32llama_model_load: n_rot   = 128llama_model_load: f16     = 2llama_model_load: n_ff    = 11008llama_model_load: n_parts = 1llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'llama_model_load: .................................... donellama_model_load: model size =  4017.27 MB / num tensors = 291
main: prompt: 'The first president of the USA was 'main: number of tokens in prompt = 9     1 -> ''  1576 -> 'The'   937 -> ' first'  6673 -> ' president'   310 -> ' of'   278 -> ' the'  8278 -> ' USA'   471 -> ' was' 29871 -> ' '
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he
 would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity.


main: mem per token = 14434244 bytes
main:     load time =  1311.74 ms
main:   sample time =   278.96 ms
main:  predict time =  7375.89 ms / 54.23 ms per token
main:    total time =  9216.61 ms

資源使用情況

第二位博主表示，在運行時，13B模型使用了大約4GB的內存，以及748%的CPU。（設定的就是讓模型使用8個CPU核心）

沒有指令微調

GPT-3和ChatGPT效果如此之好的關鍵原因之一是，它們都經過了指令微調，

這種額外的訓練使它們有能力對人類的指令做出有效的反應。比如「總結一下這個」或「寫一首關于水獺的詩」或「從這篇文章中提取要點」。

撰寫教程的博主表示，據他觀察，LLaMA并沒有這樣的能力。

也就是說，給LLaMA的提示需要采用經典的形式：「一些將由......完成的文本」。這也讓提示工程變得更加困難。

舉個例子，博主至今都還沒有想出一個正確的提示，從而讓LLaMA實現文本的總結。

責任編輯：張燕妮來源：新智元

模型芯片

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

LeCun轉贊：在蘋果M1/M2芯片上跑LLaMA！130億參數模型僅需4GB內存

在M1/M2的Mac上跑LLaMA

資源使用情況

沒有指令微調