如何用Python清理文本數據?
不是所有數據格式都會采用表格格式。隨著我們進入大數據時代,數據的格式非常多樣化,包括圖像、文本、圖形等等。
因為格式非常多樣,從一個數據到另一個數據,所以將這些數據預處理為計算機可讀的格式是非常必要的。
在本文中,將展示如何使用Python預處理文本數據,我們需要用到 NLTK 和 re-library 庫。

過程
1.文本小寫
在我們開始處理文本之前,最好先將所有字符都小寫。我們這樣做的原因是為了避免區分大小寫的過程。
假設我們想從字符串中刪除停止詞,正常操作是將非停止詞合并成一個句子。如果不使用小寫,則無法檢測到停止詞,并將導致相同的字符串。這就是為什么降低文本大小寫這么重要了。
用Python實現這一點很容易。代碼是這樣的:
- # 樣例
- x = "Watch This Airport Get Swallowed Up By A Sandstorm In Under A Minute http://t.co/TvYQczGJdy"
- # 將文本小寫
- x = x.lower()
- print(x)
- >>> watch this airport get swallowed up by a sandstorm in under a minute http://t.co/tvyqczgjdy
2.刪除 Unicode 字符
一些文章中可能包含 Unicode 字符,當我們在 ASCII 格式上看到它時,它是不可讀的。大多數情況下,這些字符用于表情符號和非 ASCII 字符。要刪除該字符,我們可以使用這樣的代碼:
- # 示例
- x = "Reddit Will Now QuarantineÛ_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP"
- # 刪除 unicode 字符
- x = x.encode('ascii', 'ignore').decode()
- print(x)
- >>> Reddit Will Now Quarantine_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP
3.刪除停止詞
停止詞是一種對文本意義沒有顯著貢獻的詞。因此,我們可以刪除這些詞。為了檢索停止詞,我們可以從 NLTK 庫中下載一個資料庫。以下為實現代碼:
- import nltk
- nltk.download()
- # 只需下載所有nltk
- stop_words = stopwords.words("english")
- # 示例
- x = "America like South Africa is a traumatised sick country - in different ways of course - but still messed up."
- # 刪除停止詞
- x = ' '.join([word for word in x.split(' ') if word not in stop_words])
- print(x)
- >>> America like South Africa traumatised sick country - different ways course - still messed up.
4.刪除諸如提及、標簽、鏈接等術語。
除了刪除 Unicode 和停止詞外,還有幾個術語需要刪除,包括提及、哈希標記、鏈接、標點符號等。
要去除這些,如果我們僅依賴于已經定義的字符,很難做到這些操作。因此,我們需要通過使用正則表達式(Regex)來匹配我們想要的術語的模式。
Regex 是一個特殊的字符串,它包含一個可以匹配與該模式相關聯的單詞的模式。通過使用名為 re. 的 Python 庫搜索或刪除這些模式。以下為實現代碼:
- import re
- # 刪除提及
- x = "@DDNewsLive @NitishKumar and @ArvindKejriwal can't survive without referring @@narendramodi . Without Mr Modi they are BIG ZEROS"
- x = re.sub("@\S+", " ", x)
- print(x)
- >>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
- # 刪除 URL 鏈接
- x = "Severe Thunderstorm pictures from across the Mid-South http://t.co/UZWLgJQzNS"
- x = re.sub("https*\S+", " ", x)
- print(x)
- >>> Severe Thunderstorm pictures from across the Mid-South
- # 刪除標簽
- x = "Are people not concerned that after #SLAB's obliteration in Scotland #Labour UK is ripping itself apart over #Labourleadership contest?"
- x = re.sub("#\S+", " ", x)
- print(x)
- >>> Are people not concerned that after obliteration in Scotland UK is ripping itself apart over contest?
- # 刪除記號和下一個字符
- x = "Notley's tactful yet very direct response to Harper's attack on Alberta's gov't. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli"
- x = re.sub("\'\w+", '', x)
- print(x)
- >>> Notley tactful yet very direct response to Harper attack on Alberta gov. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli
- # 刪除標點符號
- x = "In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare."
- x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
- print(x)
- >>> In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare.
- # 刪除數字
- x = "C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980... http://t.co/tNI92fea3u http://t.co/czBaMzq3gL"
- x = re.sub(r'\w*\d+\w*', '', x)
- print(x)
- >>> C- specially modified to land in a stadium and rescue hostages in Iran in ... http://t.co/ http://t.co/
- #替換空格
- x = " and can't survive without referring . Without Mr Modi they are BIG ZEROS"
- x = re.sub('\s{2,}', " ", x)
- print(x)
- >>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
5.功能組合
在我們了解了文本預處理的每個步驟之后,讓我們將其應用于列表。如果仔細看這些步驟,你會發現其實每個方法都是相互關聯的。因此,必須將其應用于函數,以便我們可以按順序同時處理所有問題。在應用預處理步驟之前,以下是文本示例:
- Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
- Forest fire near La Ronge Sask. Canada
- All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
- 13,000 people receive #wildfires evacuation orders in California
- Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
在預處理文本列表時,我們應先執行幾個步驟:
- 創建包含所有預處理步驟的函數,并返回預處理的字符串
- 使用名為"apply"的方法應用函數,并使用該方法將列表鏈接在一起。
代碼如下:
- # 導入錯誤的情況下
- # ! pip install nltk
- # ! pip install textblob
- import numpy as np
- import matplotlib.pyplot as plt
- import pandas as pd
- import re
- import nltk
- import string
- from nltk.corpus import stopwords
- # # 如果缺少語料庫
- # 下載 all-nltk
- nltk.download()
- df = pd.read_csv('train.csv')
- stop_words = stopwords.words("english")
- wordnet = WordNetLemmatizer()
- def text_preproc(x):
- x = x.lower()
- x = ' '.join([word for word in x.split(' ') if word not in stop_words])
- x = x.encode('ascii', 'ignore').decode()
- x = re.sub(r'https*\S+', ' ', x)
- x = re.sub(r'@\S+', ' ', x)
- x = re.sub(r'#\S+', ' ', x)
- x = re.sub(r'\'\w+', '', x)
- x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
- x = re.sub(r'\w*\d+\w*', '', x)
- x = re.sub(r'\s{2,}', ' ', x)
- return x
- df['clean_text'] = df.text.apply(text_preproc)
上面的文本預處理結果如下:
- deeds reason may allah forgive us
- forest fire near la ronge sask canada
- residents asked place notified officers evacuation shelter place orders expected
- people receive evacuation orders california
- got sent photo ruby smoke pours school
最后
以上內容就是使用 Python 進行文本預處理的具體步驟,希望能夠幫助大家用它來解決與文本數據相關的問題,提高文本數據的規范性以及模型的準確度。