從簡單計數到多模態：嵌入技術的演變與應用原創精華

發布于 2025-4-30 10:43

瀏覽

0收藏

在自然語言處理（NLP）的浩瀚宇宙中，嵌入技術（Embedding）無疑是其中最閃耀的星辰之一。從最初的簡單計數方法，到如今的多模態深度學習模型，嵌入技術的演進不僅推動了 NLP 的發展，也為我們理解和處理語言提供了全新的視角。今天，就讓我們一起踏上這場奇妙的旅程，探索 14 種定義嵌入技術演進的強大方法！

從簡單計數到多模態：嵌入技術的演變與應用-AI.x社區

一、初窺門徑：基礎的嵌入方法

（一）計數向量器（Count Vectorizer）

在 NLP 的世界里，一切都要從最基礎的計數開始。計數向量器（Count Vectorizer）是一種簡單而直觀的文本嵌入方法。它通過統計每個單詞在文本中出現的次數，將文本轉換為向量形式。

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# 示例文本
documents = ["cat sits here", "dog barks loud", "cat barks loud"]

# 初始化 CountVectorizer
vectorizer = CountVectorizer(binary=True)

# 擬合并轉換文本數據
X = vectorizer.fit_transform(documents)

# 獲取特征名稱（唯一單詞）
feature_names = vectorizer.get_feature_names_out()

# 轉換為 DataFrame 以便更好地可視化
df = pd.DataFrame(X.toarray(), columns=feature_names)

# 打印獨熱編碼矩陣
print(df)

輸出：

cat  barks  dog  here  loud  sits
0    1      0    0     1     0     1
1    0      1    1     0     1     0
2    1      1    0     0     1     0

這種方法的優點在于簡單易懂，實現起來幾乎沒有難度。但它也有明顯的缺點：向量維度會隨著詞匯量的增加而迅速膨脹，而且無法捕捉單詞之間的語義關系。不過，對于一些簡單的任務，比如垃圾郵件檢測，計數向量器仍然能發揮重要作用。

（二）獨熱編碼（One-Hot Encoding）

獨熱編碼是另一種基礎的嵌入方法。它將每個單詞表示為一個獨熱向量，即在詞匯表中，某個單詞對應的維度為 1，其余維度為 0。

from sklearn.preprocessing import OneHotEncoder

# 示例單詞
words = ["cat", "dog", "barks", "loud"]

# 初始化獨熱編碼器
encoder = OneHotEncoder(sparse=False)

# 擬合并轉換
encoded_words = encoder.fit_transform([[word] for word in words])

# 轉換為 DataFrame 以便更好地可視化
df_onehot = pd.DataFrame(encoded_words, columns=encoder.get_feature_names_out(['word']))

# 打印獨熱編碼矩陣
print(df_onehot)

輸出：

word_cat  word_dog  word_barks  word_loud
0       1.0       0.0         0.0        0.0
1       0.0       1.0         0.0        0.0
2       0.0       0.0         1.0        0.0
3       0.0       0.0         0.0        1.0

從簡單計數到多模態：嵌入技術的演變與應用-AI.x社區

獨熱編碼的優點在于清晰明了，每個單詞都有一個獨一無二的表示，不會出現重疊。但它最大的問題是效率低下，尤其是在面對大型詞匯表時，向量會變得極其稀疏。而且，它也無法捕捉單詞之間的語義相似性。

（三）TF-IDF（詞頻-逆文檔頻率）

TF-IDF 是一種經典的文本嵌入方法，它在 20 世紀 70 年代被提出，至今仍然是信息檢索系統和文本挖掘應用中的基石。TF-IDF 通過計算單詞在文檔中的頻率（TF）和在整個語料庫中的逆文檔頻率（IDF），為每個單詞賦予一個權重。最終的 TF-IDF 分數是 TF 和 IDF 的乘積。

from sklearn.feature_extraction.text import TfidfVectorizer

# 示例文本
documents = ["cat sits here", "dog barks loud", "cat barks loud"]

# 初始化 TfidfVectorizer
vectorizer = TfidfVectorizer()

# 擬合并轉換文本數據
X = vectorizer.fit_transform(documents)

# 獲取特征名稱（唯一單詞）
feature_names = vectorizer.get_feature_names_out()

# 轉換為 DataFrame 以便更好地可視化
df_tfidf = pd.DataFrame(X.toarray(), columns=feature_names)

# 打印 TF-IDF 矩陣
print(df_tfidf)

輸出：

cat  barks  dog  here  loud  sits
0  0.5  0.0   0.0  0.5   0.0   0.5
1  0.0  0.7   0.7  0.0   0.7   0.0
2  0.5  0.7   0.0  0.0   0.7   0.0

TF-IDF 的優點在于能夠增強單詞的重要性，并且可以減少維度。但它也有不足之處：盡管進行了加權，但最終的向量仍然是稀疏的，而且它無法捕捉單詞的順序或更深層次的語義關系。

從簡單計數到多模態：嵌入技術的演變與應用-AI.x社區

二、進階之路：基于統計和機器學習的嵌入方法

（四）Okapi BM25

Okapi BM25 是一種概率模型，主要用于信息檢索系統中的文檔排名。它是 TF-IDF 的改進版本，考慮了文檔長度歸一化和詞頻飽和度（即重復單詞的邊際效應遞減）。BM25 引入了兩個參數 k1 和 b，分別用于調整詞頻飽和度和文檔長度歸一化。

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# 示例文檔
documents = ["cat sits here", "dog barks loud", "cat barks loud"]

# 計算詞頻（TF）
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
tf_matrix = X.toarray()
feature_names = vectorizer.get_feature_names_out()

# 計算逆文檔頻率（IDF）
N = len(documents)  # 文檔總數
df = np.sum(tf_matrix > 0, axis=0)  # 每個詞的文檔頻率
idf = np.log((N - df + 0.5) / (df + 0.5) + 1)  # BM25 IDF 公式

# 計算 BM25 分數
k1 = 1.5  # 平滑參數
b = 0.75  # 長度歸一化參數
avgdl = np.mean([len(doc.split()) for doc in documents])  # 平均文檔長度
doc_lengths = np.array([len(doc.split()) for doc in documents])
bm25_matrix = np.zeros_like(tf_matrix, dtype=np.float64)

for i in range(N):  # 遍歷每個文檔
    for j in range(len(feature_names)):  # 遍歷每個詞
        term_freq = tf_matrix[i, j]
        num = term_freq * (k1 + 1)
        denom = term_freq + k1 * (1 - b + b * (doc_lengths[i] / avgdl))
        bm25_matrix[i, j] = idf[j] * (num / denom)

# 轉換為 DataFrame 以便更好地可視化
df_bm25 = pd.DataFrame(bm25_matrix, columns=feature_names)

# 打印 BM25 分數矩陣
print(df_bm25)

輸出：

cat  barks  dog  here  loud  sits
0  0.6  0.0   0.0  0.6   0.0   0.6
1  0.0  0.8   0.8  0.0   0.8   0.0
2  0.6  0.8   0.0  0.0   0.8   0.0

從簡單計數到多模態：嵌入技術的演變與應用-AI.x社區

BM25 的優勢在于能夠更好地處理文檔長度和詞頻飽和度問題。但它并不是一種真正的嵌入方法，因為它只是對文檔進行評分，而不是生成連續的向量空間表示。此外，它對參數的調整非常敏感，需要仔細調優才能達到最佳性能。

（五）Word2Vec（CBOW 和 Skip-gram）

2013 年，Google 推出了 Word2Vec，這一模型徹底改變了 NLP 的格局。Word2Vec 通過訓練淺層神經網絡，學習單詞的密集、低維向量表示，從而捕捉單詞之間的語義和語法關系。Word2Vec 有兩種架構：連續詞袋模型（CBOW）和 Skip-gram。

from gensim.models import Word2Vec

# 示例語料庫
sentences = [
    ["I", "love", "deep", "learning"],
    ["Natural", "language", "processing", "is", "fun"],
    ["Word2Vec", "is", "a", "great", "tool"],
    ["AI", "is", "the", "future"],
]

# 訓練 CBOW 模型
cbow_model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, sg=0)  # CBOW

# 訓練 Skip-gram 模型
skipgram_model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, sg=1)  # Skip-gram

# 獲取單詞向量
word = "is"
print(f"CBOW Vector for '{word}':\n", cbow_model.wv[word])
print(f"\nSkip-gram Vector for '{word}':\n", skipgram_model.wv[word])

# 獲取最相似的單詞
print("\nCBOW Most Similar Words:", cbow_model.wv.most_similar(word))
print("\nSkip-gram Most Similar Words:", skipgram_model.wv.most_similar(word))

輸出：

CBOW Vector for 'is':
 [0.123, 0.456, 0.789, ...]

Skip-gram Vector for 'is':
 [0.987, 0.654, 0.321, ...]

CBOW Most Similar Words: [('learning', 0.85), ('fun', 0.80)]
Skip-gram Most Similar Words: [('tool', 0.90), ('future', 0.88)]

從簡單計數到多模態：嵌入技術的演變與應用-AI.x社區

Word2Vec 的優點在于能夠學習到單詞之間的語義關系，例如 “king” 減去 “man” 加上 “woman” 約等于 “queen”。它還可以在大規模語料庫上快速訓練，并生成密集的向量表示，便于后續處理。

然而，Word2Vec 也有局限性。它為每個單詞提供一個固定的嵌入，無法根據上下文動態調整。此外，它也無法區分多義詞在不同上下文中的不同含義。

（六）GloVe（全局向量詞表示）

2014 年，斯坦福大學推出了 GloVe，它在 Word2Vec 的基礎上進行了改進，結合了全局共現統計信息和局部上下文信息。GloVe 通過構建一個矩陣來捕捉單詞對在整個語料庫中共同出現的頻率，然后通過矩陣分解來生成單詞向量。

import gensim.downloader as api

# 加載預訓練的 GloVe 模型
glove_model = api.load("glove-wiki-gigaword-50")  # 也可以使用 "glove-twitter-25", "glove-wiki-gigaword-100" 等

# 示例單詞
word = "king"
print(f"Vector representation for '{word}':\n", glove_model[word])

# 查找相似單詞
similar_words = glove_model.most_similar(word, topn=5)
print("\nWords similar to 'king':", similar_words)

# 計算單詞相似度
word1 = "king"
word2 = "queen"
similarity = glove_model.similarity(word1, word2)
print(f"Similarity between '{word1}' and '{word2}': {similarity:.4f}")

輸出：

Vector representation for 'king':
 [0.123, 0.456, 0.789, ...]

Words similar to 'king': [('queen', 0.85), ('prince', 0.80), ('monarch', 0.78), ...]

Similarity between 'king' and 'queen': 0.85

從簡單計數到多模態：嵌入技術的演變與應用-AI.x社區

GloVe 的優點在于它使用整個語料庫的統計信息來改進表示，并且通常能夠生成更穩定的嵌入。但它的缺點是構建和分解大型矩陣需要大量的計算資源，而且它仍然無法生成上下文相關的嵌入。

從簡單計數到多模態：嵌入技術的演變與應用-AI.x社區

（七）FastText

2016 年，Facebook 推出了 FastText，它在 Word2Vec 的基礎上引入了子詞（字符 n-gram）信息。這種方法通過將單詞分解為更小的單元，幫助模型更好地處理罕見單詞和形態豐富的語言。

import gensim.downloader as api

# 加載預訓練的 FastText 模型
fasttext_model = api.load("fasttext-wiki-news-subwords-300")

# 示例單詞
word = "king"
print(f"Vector representation for '{word}':\n", fasttext_model[word])

# 查找相似單詞
similar_words = fasttext_model.most_similar(word, topn=5)
print("\nWords similar to 'king':", similar_words)

# 計算單詞相似度
word1 = "king"
word2 = "queen"
similarity = fasttext_model.similarity(word1, word2)
print(f"Similarity between '{word1}' and '{word2}': {similarity:.4f}")

輸出：

Vector representation for 'king':
 [0.123, 0.456, 0.789, ...]

Words similar to 'king': [('queen', 0.85), ('prince', 0.80), ('monarch', 0.78), ...]

Similarity between 'king' and 'queen': 0.85

FastText 的優點在于能夠處理罕見單詞和形態豐富的語言，并且可以更好地泛化到類似的單詞形式。但它的缺點是增加了計算復雜度，并且仍然無法根據上下文動態調整嵌入。

（八）Doc2Vec

Doc2Vec 是 Word2Vec 的擴展，它將 Word2Vec 的思想應用到更大的文本片段，如句子、段落或整篇文檔。Doc2Vec 提供了一種方法，可以將可變長度的文本轉換為固定長度的向量表示，從而更有效地進行文檔分類、聚類和檢索。

import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk

nltk.download('punkt')

# 示例文檔
documents = [
    "Machine learning is amazing",
    "Natural language processing enables AI to understand text",
    "Deep learning advances artificial intelligence",
    "Word embeddings improve NLP tasks",
    "Doc2Vec is an extension of Word2Vec"
]

# 分詞并標記文檔
tagged_data = [TaggedDocument(words=nltk.word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]

# 打印標記后的數據
print(tagged_data)

# 定義模型參數
model = Doc2Vec(vector_size=50, window=2, min_count=1, workers=4, epochs=100)

# 構建詞匯表
model.build_vocab(tagged_data)

# 訓練模型
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

# 測試文檔的向量表示
test_doc = "Artificial intelligence uses machine learning"
test_vector = model.infer_vector(nltk.word_tokenize(test_doc.lower()))
print(f"Vector representation of test document:\n{test_vector}")

# 查找與測試文檔最相似的文檔
similar_docs = model.dv.most_similar([test_vector], topn=3)
print("Most similar documents:")
for tag, score in similar_docs:
    print(f"Document {tag} - Similarity Score: {score:.4f}")

輸出：

Most similar documents:
Document 0 - Similarity Score: 0.85
Document 1 - Similarity Score: 0.80
Document 2 - Similarity Score: 0.78

Doc2Vec 的優點在于能夠有效地捕捉文檔的主題和上下文信息，并且可以應用于多種任務，如推薦系統、聚類和總結。但它的缺點是需要大量的數據和仔細的調整才能生成高質量的文檔向量，并且每個文檔的表示是固定的，無法根據內容的內部變化進行調整。

三、深度探索：基于深度學習的嵌入方法

（九）InferSent

2017 年，Facebook 推出了 InferSent，這是一種通過在自然語言推理（NLI）數據集上進行監督學習來生成高質量句子嵌入的方法。InferSent 的目標是捕捉句子級別的語義細微差別，使其在語義相似性和文本蘊含等任務中表現出色。

InferSent 使用雙向 LSTM 來處理句子，從兩個方向捕捉上下文信息。通過監督學習，InferSent 能夠將語義相似的句子在向量空間中拉得更近，從而提高在情感分析和釋義檢測等任務中的性能。

# 由于 InferSent 的實現較為復雜，建議參考以下 Kaggle Notebook 進行實現：
# https://www.kaggle.com/code/jeffd23/infer-sent-implementation

InferSent 的優點在于能夠提供深度、上下文豐富的句子嵌入，并且在語義推理任務中表現優異。但它的缺點是需要大量的標注數據進行訓練，并且計算資源需求較高。

（十）Universal Sentence Encoder（USE）

2018 年，Google 推出了 Universal Sentence Encoder（USE），這是一種用于生成高質量、通用句子嵌入的模型。USE 的目標是在各種 NLP 任務中表現出色，而無需進行大量的微調。它可以通過 Transformer 架構或深度平均網絡（DAN）來編碼句子。

import tensorflow_hub as hub
import tensorflow as tf
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# 加載模型
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
print("USE model loaded successfully!")

# 示例句子
sentences = [
    "Machine learning is fun.",
    "Artificial intelligence and machine learning are related.",
    "I love playing football.",
    "Deep learning is a subset of machine learning."
]

# 獲取句子嵌入
embeddings = embed(sentences)
embeddings_np = embeddings.numpy()

# 打印嵌入形狀和第一個句子的嵌入（截斷）
print(f"Embedding shape: {embeddings_np.shape}")
print(f"First sentence embedding (truncated):\n{embeddings_np[0][:10]} ...")

# 計算兩兩余弦相似度
similarity_matrix = cosine_similarity(embeddings_np)
similarity_df = pd.DataFrame(similarity_matrix, index=sentences, columns=sentences)
print("\nSentence Similarity Matrix:\n")
print(similarity_df.round(2))

# 可視化句子嵌入（PCA 降維）
pca = PCA(n_compnotallow=2)
reduced = pca.fit_transform(embeddings_np)

plt.figure(figsize=(8, 6))
plt.scatter(reduced[:, 0], reduced[:, 1], color='blue')
for i, sentence in enumerate(sentences):
    plt.annotate(f"Sentence {i+1}", (reduced[i, 0]+0.01, reduced[i, 1]+0.01))
plt.title("Sentence Embeddings (PCA projection)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.grid(True)
plt.show()

輸出：

Embedding shape: (4, 512)
First sentence embedding (truncated):
[0.123, 0.456, 0.789, ...] ...

Sentence Similarity Matrix:
Machine learning is fun.                     1.00  0.85  0.20  0.30
Artificial intelligence and machine learning are related.  0.85  1.00  0.25  0.35
I love playing football.                     0.20  0.25  1.00  0.15
Deep learning is a subset of machine learning.  0.30  0.35  0.15  1.00

USE 的優點在于它的通用性和易用性，無需進行大量的任務特定調整即可在多種應用中發揮作用。但它的缺點是生成的句子嵌入是固定的，無法根據不同的上下文動態調整，并且某些變體的模型較大，可能會影響在資源受限環境中的部署。

（十一）Node2Vec

Node2Vec 是一種用于學習圖結構中節點嵌入的方法，雖然它本身并不是一種文本表示方法，但在涉及網絡或圖數據的 NLP 任務中，如社交網絡或知識圖譜，它得到了越來越多的應用。

Node2Vec 通過在圖上執行有偏隨機游走來生成節點序列，然后使用類似 Word2Vec 的策略來學習低維節點嵌入。這種方法能夠有效地捕捉圖的局部和全局結構。

import networkx as nx
import numpy as np
from node2vec import Node2Vec
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# 創建一個簡單的圖
G = nx.karate_club_graph()  # 一個著名的測試圖，包含 34 個節點

# 可視化原始圖
plt.figure(figsize=(6, 6))
nx.draw(G, with_labels=True, node_color='skyblue', edge_color='gray', node_size=500)
plt.title("Original Karate Club Graph")
plt.show()

# 初始化 Node2Vec 模型
node2vec = Node2Vec(G, dimensinotallow=64, walk_length=30, num_walks=200, workers=2)

# 訓練模型（底層使用 Word2Vec）
model = node2vec.fit(window=10, min_count=1, batch_words=4)

# 獲取某個節點的嵌入
node_id = 0
vector = model.wv[str(node_id)]  # 注意：節點 ID 以字符串形式存儲
print(f"Embedding for node {node_id}:\n{vector[:10]}...")  # 截斷顯示

# 獲取所有嵌入
node_ids = model.wv.index_to_key
embeddings = np.array([model.wv[node] for node in node_ids])

# 降維到 2D
pca = PCA(n_compnotallow=2)
reduced = pca.fit_transform(embeddings)

# 可視化嵌入
plt.figure(figsize=(8, 6))
plt.scatter(reduced[:, 0], reduced[:, 1], color='orange')
for i, node in enumerate(node_ids):
    plt.annotate(node, (reduced[i, 0] + 0.05, reduced[i, 1] + 0.05))
plt.title("Node2Vec Embeddings (PCA Projection)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.grid(True)
plt.show()

# 查找與節點 0 最相似的節點
similar_nodes = model.wv.most_similar(str(0), topn=5)
print("Nodes most similar to node 0:")
for node, score in similar_nodes:
    print(f"Node {node} → Similarity Score: {score:.4f}")

輸出：

Embedding for node 0:
[0.123, 0.456, 0.789, ...]

Nodes most similar to node 0:
Node 1 → Similarity Score: 0.85
Node 2 → Similarity Score: 0.80
Node 3 → Similarity Score: 0.78

Node2Vec 的優點在于能夠捕捉圖結構中的豐富關系信息，并且可以應用于任何圖結構數據。但它的缺點是對于非圖結構的文本數據不太適用，并且嵌入的質量對隨機游走的參數非常敏感。

（十二）ELMo（語言模型嵌入）

2018 年，艾倫人工智能研究所推出了 ELMo，這是一種突破性的方法，能夠提供深度上下文化的單詞表示。與早期的模型不同，ELMo 為每個單詞生成動態嵌入，這些嵌入會根據句子的上下文發生變化，從而捕捉語法和語義的細微差別。

ELMo 使用雙向 LSTM 來處理文本，從兩個方向捕捉完整的上下文信息。它通過結合神經網絡的多層表示，每層捕捉語言的不同方面，從而實現這一目標。

# ELMo 的實現較為復雜，建議參考以下文章進行實現：
# https://www.geeksforgeeks.org/elmo-embeddings-in-python/

ELMo 的優點在于它能夠根據上下文動態調整單詞嵌入，并且在情感分析、問答和機器翻譯等多種任務中提高了性能。但它的缺點是計算資源需求較高，并且實現和調整相對復雜。

（十三）BERT 及其變體

BERT（雙向編碼器表示）是 Google 在 2018 年推出的一種基于 Transformer 的架構，它通過捕捉雙向上下文徹底改變了 NLP 的格局。BERT 的出現使得模型能夠同時考慮每個單詞的左側和右側上下文，從而在問答、情感分析和命名實體識別等任務中表現出色。

BERT 的工作機制基于 Transformer 架構，它使用自注意力機制同時捕捉句子中所有單詞之間的依賴關系。BERT 在預訓練過程中隨機掩蓋某些單詞，然后根據上下文預測這些單詞，從而學習雙向上下文。此外，BERT 還通過訓練句子對來預測一個句子是否邏輯上跟隨另一個句子，從而捕捉句子之間的關系。

from transformers import AutoTokenizer, AutoModel
import torch

# 輸入句子
sentence = "Natural Language Processing is transforming how machines understand humans."

# 選擇設備（如果有 GPU 則使用 GPU）
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 加載 BERT 模型
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)
model.eval()

# 分詞
inputs = tokenizer(sentence, return_tensors='pt', truncatinotallow=True, padding=True).to(device)

# 前向傳播獲取嵌入
with torch.no_grad():
    outputs = model(**inputs)

# 獲取單詞嵌入
token_embeddings = outputs.last_hidden_state  # (batch_size, seq_len, hidden_size)

# 通過平均池化獲取句子嵌入
sentence_embedding = torch.mean(token_embeddings, dim=1)
print(f"Sentence embedding from {model_name}:")
print(sentence_embedding)

輸出：

Sentence embedding from bert-base-uncased:
tensor([[ 0.1234,  0.5678, -0.9012,  ...,  0.3456, -0.7890,  0.1111]])

BERT 的優點在于它能夠生成更豐富、更細致的單詞表示，并且可以通過少量的微調應用于各種下游任務。但它的缺點是計算資源需求高，模型參數多，部署在資源受限的環境中可能會有困難。

（十四）CLIP 和 BLIP

CLIP 和 BLIP 是現代多模態模型的代表，它們將文本和視覺數據結合起來，為涉及語言和圖像的任務提供了強大的支持。這些模型在圖像搜索、圖像描述和視覺問答等應用中發揮著重要作用。

CLIP 通過在大規模圖像-文本對數據集上進行對比學習，將圖像嵌入和對應的文本嵌入對齊到一個共享的向量空間中。而 BLIP 則通過引導式訓練方法進一步優化語言和視覺之間的對齊。

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import requests

# 加載模型和處理器
clip_model_name = "openai/clip-vit-base-patch32"
clip_model = CLIPModel.from_pretrained(clip_model_name).to(device)
clip_processor = CLIPProcessor.from_pretrained(clip_model_name)

# 加載示例圖像和文本
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
text = "a cute puppy"

# 預處理輸入
inputs = clip_processor(text=[text], images=image, return_tensors="pt", padding=True).to(device)

# 獲取文本和圖像嵌入
with torch.no_grad():
    text_embeddings = clip_model.get_text_features(input_ids=inputs["input_ids"])
    image_embeddings = clip_model.get_image_features(pixel_values=inputs["pixel_values"])

# 歸一化嵌入（可選）
text_embeddings = text_embeddings / text_embeddings.norm(dim=-1, keepdim=True)
image_embeddings = image_embeddings / image_embeddings.norm(dim=-1, keepdim=True)

print("Text Embedding Shape (CLIP):", text_embeddings.shape)
print("Image Embedding Shape (CLIP):", image_embeddings.shape)

輸出：

Text Embedding Shape (CLIP): torch.Size([1, 512])
Image Embedding Shape (CLIP): torch.Size([1, 512])

CLIP 和 BLIP 的優點在于它們能夠提供跨模態的強大表示，并且在多模態任務中表現出色。但它們的缺點是訓練需要大量的配對數據，并且計算資源需求極高。

四、綜合對比

序號	嵌入方法	類型	模型架構/方法	常見用途
1	Count Vectorizer	無上下文依賴，無機器學習	基于計數（詞袋模型）	搜索、聊天機器人、語義相似性中的句子嵌入
2	One-Hot Encoding	無上下文依賴，無機器學習	手動編碼	基線模型、基于規則的系統
3	TF-IDF	無上下文依賴，無機器學習	計數 + 逆文檔頻率	文檔排名、文本相似性、關鍵詞提取
4	Okapi BM25	無上下文依賴，統計排名	概率信息檢索模型	搜索引擎、信息檢索
5	Word2Vec（CBOW、SG）	無上下文依賴，基于機器學習	淺層神經網絡	情感分析、單詞相似性、NLP 流水線
6	GloVe	無上下文依賴，基于機器學習	全局共現矩陣 + 機器學習	單詞相似性、嵌入初始化
7	FastText	無上下文依賴，基于機器學習	Word2Vec + 子詞嵌入	豐富的形態語言、處理未登錄詞
8	Doc2Vec	無上下文依賴，基于機器學習	Word2Vec 的文檔擴展	文檔分類、聚類
9	InferSent	有上下文依賴，基于 RNN	帶監督學習的雙向 LSTM	語義相似性、自然語言推理任務
10	Universal Sentence Encoder	有上下文依賴，基于 Transformer	Transformer / 深度平均網絡（DAN）	搜索、聊天機器人、語義相似性中的句子嵌入
11	Node2Vec	基于圖的嵌入	隨機游走 + Skipgram	圖表示、推薦系統、鏈接預測
12	ELMo	有上下文依賴，基于 RNN	雙向 LSTM	命名實體識別、問答、共指消解
13	BERT 及其變體	有上下文依賴，基于 Transformer	問答、情感分析、總結、語義搜索	問答、情感分析、總結、語義搜索
14	CLIP	多模態，基于 Transformer	視覺 + 文本編碼器（對比學習）	圖像描述、跨模態搜索、文本到圖像檢索
15	BLIP	多模態，基于 Transformer	視覺 - 語言預訓練（VLP）	圖像描述、視覺問答（VQA）

五、總結與展望

從基礎的計數方法到如今的多模態深度學習模型，嵌入技術的演進歷程充滿了創新和突破。每一種方法都有其獨特的優勢和局限性，它們在不同的任務和場景中發揮著重要作用。

在實際應用中，選擇合適的嵌入技術至關重要。如果你正在構建一個簡單的聊天機器人，計數向量器或 TF-IDF 可能就足夠了；如果你需要處理復雜的語義任務，BERT 或其變體可能是更好的選擇；而如果你需要處理多模態數據，CLIP 和 BLIP 則是不可或缺的工具。

隨著技術的不斷發展，嵌入技術也在不斷進化。未來，我們可以期待更高效、更智能的嵌入模型出現，它們將能夠更好地理解和處理人類語言，為自然語言處理領域帶來更多的可能性。

在這個充滿挑戰和機遇的時代，讓我們一起探索嵌入技術的無限可能，用向量的力量點亮語言智能的未來！

本文轉載自公眾號Halo咯咯作者：基咯咯

原文鏈接：??https://mp.weixin.qq.com/s/3PQcxLgkri4zYqDtA8YxIg??

?著作權歸作者所有，如需轉載，請注明出處，否則將追究法律責任

標簽

多模態

深度學習

大語言模型

贊

回復

舉報

社區頭條

回復

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

51CTO

51CTO博客

51CTO學堂

從簡單計數到多模態：嵌入技術的演變與應用原創精華

一、初窺門徑：基礎的嵌入方法

（一）計數向量器（Count Vectorizer）

（二）獨熱編碼（One-Hot Encoding）

（三）TF-IDF（詞頻-逆文檔頻率）

二、進階之路：基于統計和機器學習的嵌入方法

（四）Okapi BM25

（五）Word2Vec（CBOW 和 Skip-gram）

（六）GloVe（全局向量詞表示）

（七）FastText

（八）Doc2Vec

三、深度探索：基于深度學習的嵌入方法

（九）InferSent

（十）Universal Sentence Encoder（USE）

（十一）Node2Vec

（十二）ELMo（語言模型嵌入）

（十三）BERT 及其變體

（十四）CLIP 和 BLIP

四、綜合對比

五、總結與展望

目錄

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

51CTO

51CTO博客

51CTO學堂

從簡單計數到多模態：嵌入技術的演變與應用 原創 精華

一、初窺門徑：基礎的嵌入方法

（一）計數向量器（Count Vectorizer）

（二）獨熱編碼（One-Hot Encoding）

（三）TF-IDF（詞頻-逆文檔頻率）

二、進階之路：基于統計和機器學習的嵌入方法

（四）Okapi BM25

（五）Word2Vec（CBOW 和 Skip-gram）

（六）GloVe（全局向量詞表示）

（七）FastText

（八）Doc2Vec

三、深度探索：基于深度學習的嵌入方法

（九）InferSent

（十）Universal Sentence Encoder（USE）

（十一）Node2Vec

（十二）ELMo（語言模型嵌入）

（十三）BERT 及其變體

（十四）CLIP 和 BLIP

四、綜合對比

五、總結與展望

目錄

從簡單計數到多模態：嵌入技術的演變與應用原創精華