五個很少被提到但能提高NLP工作效率的Python庫

作者：deephub 2021-12-27 16:09:54

本篇文章將分享5個很棒但是卻不被常被提及的Python庫，這些庫可以幫你解決各種自然語言處理(NLP)工作。

Contractions

Contractions它可以擴展常見的英語縮寫和俚語。并且可以快速、高效的處理大多數(shù)邊緣情況，例如缺少撇號。

例如：以前需要編寫一長串正則表達式來擴展文本數(shù)據(jù)中的(即 don’t → do not;can’t → cannot;haven’t → have not)。Contractions就可以解決這個問題

pip install contractions

使用樣例

import contractions 
s = "ive gotta go! i'll see yall later." 
text = contractions.fix(s, slang=True) 
print(text)

結(jié)果

ORIGINAL: ive gotta go! i’ll see yall later. 
OUTPUT: I have got to go! I will see you all later.

文本預處理的一個重要部分是創(chuàng)建一致性并在不失去太多意義的情況下減少單詞列表。詞袋模型和 TF-IDF 創(chuàng)建大型稀疏矩陣，其中每個變量都是語料庫中一個不同的詞匯詞。將縮略語進行還原可以進一步降低維度，還可以有助于過濾停用詞。

Distilbert-Punctuator

將丟失的標點符號的文本進行斷句并添加標點符號……聽起來很容易，對吧? 對于計算機來說，做到這一點肯定要復雜得多。

Distilbert-punctuator 是我能找到的唯一可以執(zhí)行此任務的 Python 庫。而且還超級準! 這是因為它使用了 BERT 的精簡變體。在結(jié)合 20,000 多篇新聞文章和 4,000 份 TED Talk 抄本后，對模型進行了進一步微調(diào)，以檢測句子邊界。在插入句尾標點符號(例如句號)時，模型還會適當?shù)貙⑾乱粋€起始字母大寫。

安裝

pip install distilbert-punctuator

這個庫需要相當多的依賴項，如果只是想測試，可以在 Google Colab 上試用。

使用樣例

from dbpunctuator.inference import Inference, InferenceArguments 
from dbpunctuator.utils import DEFAULT_ENGLISH_TAG_PUNCTUATOR_MAP 
args = InferenceArguments( 
        model_name_or_path="Qishuai/distilbert_punctuator_en", 
        tokenizer_name="Qishuai/distilbert_punctuator_en", 
        tag2punctuator=DEFAULT_ENGLISH_TAG_PUNCTUATOR_MAP 
    ) 
punctuator_model = Inference(inference_args=args,  
                             verbose=False) 
text = [ 
""" 
however when I am elected I vow to protect our American workforce 
unlike my opponent I have faith in our perseverance our sense of trust and our democratic principles will you support me 
""" 
] 
 
print(punctuator_model.punctuation(text)[0])

結(jié)果

ORIGINAL:  
however when I am elected I vow to protect our American workforce 
unlike my opponent I have faith in our perseverance our sense of trust and our democratic principles will you support me 
 
OUTPUT: 
However, when I am elected, I vow to protect our American workforce. Unlike my opponent, I have faith in our perseverance, our sense of trust and our democratic principles. Will you support me?

如果你只是希望文本數(shù)據(jù)在語法上更加正確和易于展示。無論任務是修復凌亂的 Twitter 帖子還是聊天機器人消息，這個庫都適合你。

Textstat

Textstat 是一個易于使用的輕量級庫，可提供有關(guān)文本數(shù)據(jù)的各種指標，例如閱讀水平、閱讀時間和字數(shù)。

pip install textstat

使用樣例

import textstat 
text = """ 
Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  
""" 
# Flesch reading ease score 
print(textstat.flesch_reading_ease(text)) 
  # 90-100 | Very Easy 
  # 80-89  | Easy 
  # 70-79  | Fairly Easy 
  # 60-69  | Standard 
  # 50-59  | Fairly Difficult 
  # 30-49  | Difficult 
  # <30    | Very Confusing 
 
# Reading time (output in seconds) 
# Assuming 70 milliseconds/character 
 
print(textstat.reading_time(text, ms_per_char=70))# Word count  
print(textstat.lexicon_count(text, removepunct=True))

結(jié)果

ORIGINAL: 
Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. 
 
OUTPUTS: 
74.87 # reading score is considered 'Fairly Easy' 
7.98  # 7.98 seconds to read 
30    # 30 words

這個庫還為這些指標增加了一個額外的分析層。例如，一個八卦雜志上的名人新聞文章的數(shù)據(jù)集。使用textstat，你會發(fā)現(xiàn)閱讀速度更快更容易閱讀的文章更受歡迎，留存率更高。

Gibberish-Detector

這個低代碼庫的主要目的是檢測難以理解的單詞(或胡言亂語)。它在大量英語單詞上訓練的模型。

pip install gibberish-detector

安裝完成后還需要自己訓練模型，但這非常簡單，只需一分鐘。訓練步驟如下：

從這里下載名為 big.txt 的訓練語料庫
打開你的 CLI 并 cd 到 big.txt 所在的目錄
運行以下命令：gibberish-detector train .\big.txt > gibberish-detector.model

這將在當前目錄中創(chuàng)建一個名為 gibberish-detector.model 的文件。

使用樣例

from gibberish_detector import detector 
# load the gibberish detection model 
Detector = detector.create_from_model('.\gibberish-detector.model') 
 
text1 = "xdnfklskasqd" 
print(Detector.is_gibberish(text1)) 
 
text2 = "apples" 
print(Detector.is_gibberish(text2))

結(jié)果

True  # xdnfklskasqd (this is gibberish) 
False # apples (this is not)

它可以幫助我從數(shù)據(jù)集中刪除不良觀察結(jié)果。還可以實現(xiàn)對用戶輸入的錯誤處理。例如，如果用戶在您的 Web 應用程序上輸入無意義的胡言亂語文本，這時可以返回一條錯誤消息。

NLPAug

最好的要留到最后。

首先，什么是數(shù)據(jù)增強?它是通過添加現(xiàn)有數(shù)據(jù)的稍微修改的副本來擴展訓練集大小的任何技術(shù)。當現(xiàn)有數(shù)據(jù)的多樣性有限或不平衡時，通常使用數(shù)據(jù)增強。對于計算機視覺問題，增強用于通過裁剪、旋轉(zhuǎn)和改變圖像的亮度來創(chuàng)建新樣本。對于數(shù)值數(shù)據(jù)，可以使用聚類技術(shù)創(chuàng)建合成實例。

但是如果我們正在處理文本數(shù)據(jù)呢?這就是 NLPAug 的用武之地。該庫可以通過替換或插入語義關(guān)聯(lián)的單詞來擴充文本。通過使用像 BERT 這樣的預訓練語言模型來進行數(shù)據(jù)的增強，這是一種強大的方法，因為它考慮了單詞的上下文。根據(jù)設(shè)置的參數(shù)，可以使用前 n 個相似詞來修改文本。

預訓練的詞嵌入，如 Word2Vec 和 GloVe，也可用于用同義詞替換詞。

pip install nlpaug

使用樣例

import nlpaug.augmenter.word as naw 
 
# main parameters to adjust 
ACTION = 'substitute' # or use 'insert' 
TOP_K = 15 # randomly draw from top 15 suggested words 
AUG_P = 0.40 # augment 40% of words within text 
 
aug_bert = naw.ContextualWordEmbsAug( 
    model_path='bert-base-uncased',  
    action=ACTION,  
    top_k=TOP_K, 
    aug_p=AUG_P 
    ) 
 
text = """ 
Come into town with me today to buy food! 
""" 
augmented_text = aug_bert.augment(text, n=3) # n: num. of outputs 
print(augmented_text)

結(jié)果

ORIGINAL: 
Come into town with me today to buy food! 
 
OUTPUTS: 
• drove into denver with me today to purchase groceries! 
• head off town with dad today to buy coffee! 
• come up shop with mom today to buy lunch!

假設(shè)你正在使用一個具有 15k 條正面評論和僅 4k 條負面評論的數(shù)據(jù)集上訓練監(jiān)督分類模型。嚴重不平衡的數(shù)據(jù)集會在訓練期間產(chǎn)生對多數(shù)類(正面評價)的模型偏差。

簡單地復制少數(shù)類的示例(負面評論)不會向模型添加任何新信息。相反，利用 NLPAug 的高級文本增強功能來增加多樣性的少數(shù)類。該技術(shù)已被證明可以提高 AUC 和 F1-Score。

結(jié)論

作為數(shù)據(jù)科學家、Kaggle 參與者或一般程序員，重要的是我們需要找到更多的工具來簡化我們的工作流程。這樣可以利用這些庫來解決問題，增強我們的數(shù)據(jù)集，并花更多時間思考解決方案而不是編寫代碼。

責任編輯：華軒來源：今日頭條

Python 自然語言開發(fā)

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

五個很少被提到但能提高NLP工作效率的Python庫

Contractions

Distilbert-Punctuator

Textstat

Gibberish-Detector

結(jié)果

NLPAug

結(jié)論