五個 Python 操作，輕松搞定文本分析

作者：用戶007 2025-06-10 08:25:00

本文將深入探討 Python 在文本分析中最常用的五個操作，幫助你掌握文本分析的核心技能。

Python憑借其強大的庫支持和簡潔的語法，已成為文本分析領域的首選語言。無論是處理大規模文本數據、進行自然語言處理（NLP），還是生成有價值的洞察，Python都能提供高效的解決方案。本文將深入探討Python在文本分析中最常用的5個操作，幫助你掌握文本分析的核心技能。

1. 文本預處理與清洗

文本預處理是文本分析的基礎步驟，目的是清理和標準化文本數據，使其更適合后續處理。

常用操作：

去除標點符號：使用string庫或正則表達式。
轉換為小寫：統一文本格式。
去除停用詞：使用nltk庫去除常見但無意義的詞匯。
詞干提取和詞形還原：使用nltk或spaCy庫。

import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# 下載NLTK數據
nltk.download('stopwords')
nltk.download('wordnet')

# 示例文本
text = "Hello, world! This is a sample text with some punctuation and stop words."

# 去除標點符號
text = text.translate(str.maketrans('', '', string.punctuation))

# 轉換為小寫
text = text.lower()

# 去除停用詞
stop_words = set(stopwords.words('english'))
words = text.split()
filtered_text = ' '.join([word for word in words if word not in stop_words])

# 詞干提取
stemmer = PorterStemmer()
stemmed_text = ' '.join([stemmer.stem(word) for word in filtered_text.split()])

# 詞形還原
lemmatizer = WordNetLemmatizer()
lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in filtered_text.split()])

print("原始文本:", text)
print("去除標點后的文本:", text)
print("去除停用詞后的文本:", filtered_text)
print("詞干提取后的文本:", stemmed_text)
print("詞形還原后的文本:", lemmatized_text)

關鍵點解析：

去除標點符號和轉換為小寫：確保文本一致性。
去除停用詞：減少噪聲，提高處理效率。
詞干提取和詞形還原：標準化詞匯形式，便于后續分析。

2. 詞頻統計

詞頻統計是文本分析的基本操作之一，用于了解文本中詞匯的分布情況。

常用操作：

簡單詞頻統計：使用collections.Counter。
繪制詞云：使用wordcloud庫可視化詞頻。

from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# 示例文本
text = "This is a sample text. This text is used to demonstrate word frequency analysis."

# 詞頻統計
word_counts = Counter(text.split())

# 繪制詞云
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_counts)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

關鍵點解析：

詞頻統計：快速了解文本中詞匯的分布情況。
詞云：直觀展示高頻詞匯，便于發現文本特征。

3. 情感分析

情感分析是NLP中的一個重要任務，用于判斷文本的情感傾向（正面、負面或中立）。

常用操作：

基于規則的情感分析：使用預定義的情感詞典。
基于機器學習的情感分析：使用TextBlob或VADER等工具。

from textblob import TextBlob

# 示例文本
text = "I love this product! It's amazing and very useful."

# 使用TextBlob進行情感分析
blob = TextBlob(text)
sentiment = blob.sentiment

print(f"情感極性: {sentiment.polarity}")
print(f"主觀性: {sentiment.subjectivity}")

if sentiment.polarity > 0:
    print("情感傾向: 正面")
elif sentiment.polarity < 0:
    print("情感傾向: 負面")
else:
    print("情感傾向: 中立")

關鍵點解析：

情感極性：數值范圍在-1到1之間，表示文本的情感傾向。
主觀性：數值范圍在0到1之間，表示文本的主觀程度。

4. 文本分類

文本分類是將文本劃分到預定義類別的過程，廣泛應用于垃圾郵件過濾、情感分析等領域。

常用操作：

特征提取：使用TF-IDF或詞袋模型。
機器學習模型訓練：使用scikit-learn庫。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# 示例數據
texts = ["I love this product", "This is a great movie", "I hate this book", "This is a terrible experience"]
labels = [1, 1, 0, 0]  # 1表示正面，0表示負面

# 特征提取
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# 劃分訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# 訓練模型
model = MultinomialNB()
model.fit(X_train, y_train)

# 預測
y_pred = model.predict(X_test)

# 評估
accuracy = accuracy_score(y_test, y_pred)
print(f"準確率: {accuracy}")

關鍵點解析：

特征提取：將文本轉換為數值特征向量。
機器學習模型：訓練模型以進行分類。
評估：使用準確率等指標評估模型性能。

5. 主題建模

主題建模是一種無監督學習方法，用于發現文本數據中的潛在主題。

常用操作：

LDA（Latent Dirichlet Allocation）：使用Gensim庫進行主題建模。
可視化：使用pyLDAvis庫進行結果可視化。

from gensim import corpora, models
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# 示例數據
texts = [["this", "is", "a", "sample", "text"],
         ["another", "example", "of", "text", "data"],
         ["more", "text", "to", "demonstrate", "topic", "modeling"]]

# 創建詞典和語料庫
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# 訓練LDA模型
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)

# 可視化
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

關鍵點解析：

LDA：發現文本數據中的潛在主題。
可視化：直觀展示主題及其相關詞匯。

責任編輯：趙寧寧來源： Python數智工坊

Python 文本分析數據分析

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

五個 Python 操作，輕松搞定文本分析

1. 文本預處理與清洗

2. 詞頻統計

3. 情感分析

4. 文本分類

5. 主題建模