成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認證

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考信創(chuàng)認證華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

Python中七種主要關鍵詞提取算法的基準測試

作者：deephub 2021-11-26 21:55:44

開發(fā) 后端算法

我一直在尋找有效關鍵字提取任務算法。目標是找到一種算法，能夠以有效的方式提取關鍵字，并且能夠平衡提取質量和執(zhí)行時間，因為我的數(shù)據(jù)語料庫迅速增加已經達到了數(shù)百萬行。

我一直在尋找有效關鍵字提取任務算法。目標是找到一種算法，能夠以有效的方式提取關鍵字，并且能夠平衡提取質量和執(zhí)行時間，因為我的數(shù)據(jù)語料庫迅速增加已經達到了數(shù)百萬行。我對于算法一個主要的要求是提取關鍵字本身總是要有意義的，即使脫離了上下文的語境也能夠表達一定的含義。

本篇文章使用 2000 個文檔的語料庫對幾種著名的關鍵字提取算法進行測試和試驗。

使用的庫列表

我使用了以下python庫進行研究

NLTK，以幫助我在預處理階段和一些輔助函數(shù)

RAKE
YAKE
PKE
KeyBERT
Spacy

Pandas 和Matplotlib還有其他通用庫

實驗流程

基準測試的工作方式如下

我們將首先導入包含我們的文本數(shù)據(jù)的數(shù)據(jù)集。然后，我們將為每個算法創(chuàng)建提取邏輯的單獨函數(shù)

algorithm_name(str: text) → [keyword1, keyword2, ..., keywordn]

然后，我們創(chuàng)建的一個函數(shù)用于提取整個語料庫的關鍵詞。

extract_keywords_from_corpus(algorithm, corpus) → {algorithm, corpus_keywords, elapsed_time}

下一步，使用Spacy幫助我們定義一個匹配器對象，用來判斷關鍵字是否對我們的任務有意義，該對象將返回 true 或 false。

最后，我們會將所有內容打包到一個輸出最終報告的函數(shù)中。

數(shù)據(jù)集

我使用的是來自互聯(lián)網的小文本數(shù)數(shù)據(jù)集。這是一個樣本

['To follow up from my previous questions. . Here is the result!\n', 
'European mead competitions?\nI’d love some feedback on my mead, but entering the Mazer Cup isn’t an option for me, since shipping alcohol to the USA from Europe is illegal. (I know I probably wouldn’t get caught/prosecuted, but any kind of official record of an issue could screw up my upcoming citizenship application and I’m not willing to risk that).\n\nAre there any European mead comps out there? Or at least large beer comps that accept entries in the mead categories and are likely to have experienced mead judges?', 'Orange Rosemary Booch\n', 'Well folks, finally happened. Went on vacation and came home to mold.\n', 'I’m opening a gelato shop in London on Friday so we’ve been up non-stop practicing flavors - here’s one of our most recent attempts!\n', "Does anyone have resources for creating shelf stable hot sauce? Ferment and then water or pressure can?\nI have dozens of fresh peppers I want to use to make hot sauce, but the eventual goal is to customize a recipe and send it to my buddies across the States. I believe canning would be the best way to do this, but I'm not finding a lot of details on it. Any advice?", 'what is the practical difference between a wine filter and a water filter?\nwondering if you could use either', 'What is the best custard base?\nDoes someone have a recipe that tastes similar to Culver’s frozen custard?', 'Mold?\n'

大部分是與食物相關的。我們將使用2000個文檔的樣本來測試我們的算法。

我們現(xiàn)在還沒有對文本進行預處理，因為有一些算法的結果是基于stopwords和標點符號的。

算法

讓我們定義關鍵字提取函數(shù)。

# initiate BERT outside of functions 
bert = KeyBERT() 
# 1. RAKE 
def rake_extractor(text): 
""" 
Uses Rake to extract the top 5 keywords from a text 
Arguments: text (str) 
Returns: list of keywords (list) 
""" 
r = Rake() 
r.extract_keywords_from_text(text) 
return r.get_ranked_phrases()[:5] 
# 2. YAKE 
def yake_extractor(text): 
""" 
Uses YAKE to extract the top 5 keywords from a text 
Arguments: text (str) 
Returns: list of keywords (list) 
""" 
keywords = yake.KeywordExtractor(lan="en", n=3, windowsSize=3, top=5).extract_keywords(text) 
results = [] 
for scored_keywords in keywords: 
for keyword in scored_keywords: 
if isinstance(keyword, str): 
results.append(keyword)  
return results  
# 3. PositionRank 
def position_rank_extractor(text): 
""" 
Uses PositionRank to extract the top 5 keywords from a text 
Arguments: text (str) 
Returns: list of keywords (list) 
""" 
# define the valid Part-of-Speeches to occur in the graph 
pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 
extractor = pke.unsupervised.PositionRank() 
extractor.load_document(text, language='en') 
extractor.candidate_selection(pos=pos, maximum_word_number=5) 
# 4. weight the candidates using the sum of their word's scores that are 
# computed using random walk biaised with the position of the words 
# in the document. In the graph, nodes are words (nouns and 
# adjectives only) that are connected if they occur in a window of 
# 3 words. 
extractor.candidate_weighting(window=3, pos=pos) 
# 5. get the 5-highest scored candidates as keyphrases 
keyphrases = extractor.get_n_best(n=5) 
results = [] 
for scored_keywords in keyphrases: 
for keyword in scored_keywords: 
if isinstance(keyword, str): 
results.append(keyword)  
return results  
# 4. SingleRank 
def single_rank_extractor(text): 
""" 
Uses SingleRank to extract the top 5 keywords from a text 
Arguments: text (str) 
Returns: list of keywords (list) 
""" 
pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 
extractor = pke.unsupervised.SingleRank() 
extractor.load_document(text, language='en') 
extractor.candidate_selection(pos=pos) 
extractor.candidate_weighting(window=3, pos=pos) 
keyphrases = extractor.get_n_best(n=5) 
results = [] 
for scored_keywords in keyphrases: 
for keyword in scored_keywords: 
if isinstance(keyword, str): 
results.append(keyword)  
return results  
# 5. MultipartiteRank 
def multipartite_rank_extractor(text): 
""" 
Uses MultipartiteRank to extract the top 5 keywords from a text 
Arguments: text (str) 
Returns: list of keywords (list) 
""" 
extractor = pke.unsupervised.MultipartiteRank() 
extractor.load_document(text, language='en') 
pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 
extractor.candidate_selection(pos=pos) 
# 4. build the Multipartite graph and rank candidates using random walk, 
# alpha controls the weight adjustment mechanism, see TopicRank for 
# threshold/method parameters. 
extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average') 
keyphrases = extractor.get_n_best(n=5) 
results = [] 
for scored_keywords in keyphrases: 
for keyword in scored_keywords: 
if isinstance(keyword, str): 
results.append(keyword)  
return results 
# 6. TopicRank 
def topic_rank_extractor(text): 
""" 
Uses TopicRank to extract the top 5 keywords from a text 
Arguments: text (str) 
Returns: list of keywords (list) 
""" 
extractor = pke.unsupervised.TopicRank() 
extractor.load_document(text, language='en') 
pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 
extractor.candidate_selection(pos=pos) 
extractor.candidate_weighting() 
keyphrases = extractor.get_n_best(n=5) 
results = [] 
for scored_keywords in keyphrases: 
for keyword in scored_keywords: 
if isinstance(keyword, str): 
results.append(keyword)  
return results 
# 7. KeyBERT 
def keybert_extractor(text): 
""" 
Uses KeyBERT to extract the top 5 keywords from a text 
Arguments: text (str) 
Returns: list of keywords (list) 
""" 
keywords = bert.extract_keywords(text, keyphrase_ngram_range=(3, 5), stop_words="english", top_n=5) 
results = [] 
for scored_keywords in keywords: 
for keyword in scored_keywords: 
if isinstance(keyword, str): 
results.append(keyword) 
return results

每個提取器將文本作為參數(shù)輸入并返回一個關鍵字列表。對于使用來講非常簡單。

注意:由于某些原因，我不能在函數(shù)之外初始化所有提取器對象。每當我這樣做時，TopicRank和MultiPartiteRank都會拋出錯誤。就性能而言，這并不完美，但基準測試仍然可以完成。

Python中7種主要關鍵詞提取算法的基準測試

我們已經通過傳遞 pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 來限制一些可接受的語法模式——這與 Spacy 一起將確保幾乎所有的關鍵字都是從人類語言視角來選擇的。我們還希望關鍵字包含三個單詞，只是為了有更具體的關鍵字并避免過于籠統(tǒng)。

從整個語料庫中提取關鍵字

現(xiàn)在讓我們定義一個函數(shù)，該函數(shù)將在輸出一些信息的同時將單個提取器應用于整個語料庫。

def extract_keywords_from_corpus(extractor, corpus): 
"""This function uses an extractor to retrieve keywords from a list of documents""" 
extractor_name = extractor.__name__.replace("_extractor", "") 
logging.info(f"Starting keyword extraction with {extractor_name}") 
corpus_kws = {} 
start = time.time() 
# logging.info(f"Timer initiated.") <-- uncomment this if you want to output start of timer 
for idx, text in tqdm(enumerate(corpus), desc="Extracting keywords from corpus..."): 
corpus_kws[idx] = extractor(text) 
end = time.time() 
# logging.info(f"Timer stopped.") <-- uncomment this if you want to output end of timer 
elapsed = time.strftime("%H:%M:%S", time.gmtime(end - start)) 
logging.info(f"Time elapsed: {elapsed}") 
 
return {"algorithm": extractor.__name__,  
"corpus_kws": corpus_kws,  
"elapsed_time": elapsed}

這個函數(shù)所做的就是將傳入的提取器數(shù)據(jù)和一系列有用的信息組合成一個字典(比如執(zhí)行任務花費了多少時間)來方便我們后續(xù)生成報告。

語法匹配函數(shù)

這個函數(shù)確保提取器返回的關鍵字始終(幾乎?)意義。例如，

Python中7種主要關鍵詞提取算法的基準測試

我們可以清楚地了解到，前三個關鍵字可以獨立存在，它們完全是有意義的。我們不需要更多信息來理解關鍵詞的含義，但是第四個就毫無任何意義，所以需要盡量避免這種情況。

Spacy 與 Matcher 對象可以幫助我們做到這一點。我們將定義一個匹配函數(shù)，它接受一個關鍵字，如果定義的模式匹配，則返回 True 或 False。

def match(keyword): 
"""This function checks if a list of keywords match a certain POS pattern""" 
patterns = [ 
[{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'VERB'}], 
[{'POS': 'NOUN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}], 
[{'POS': 'VERB'}, {'POS': 'NOUN'}], 
[{'POS': 'ADJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],  
[{'POS': 'NOUN'}, {'POS': 'VERB'}], 
[{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}], 
[{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'NOUN'}], 
[{'POS': 'ADJ'}, {'POS': 'NOUN'}], 
[{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}], 
[{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'ADV'}, {'POS': 'PROPN'}], 
[{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'VERB'}], 
[{'POS': 'PROPN'}, {'POS': 'PROPN'}], 
[{'POS': 'NOUN'}, {'POS': 'NOUN'}], 
[{'POS': 'ADJ'}, {'POS': 'PROPN'}], 
[{'POS': 'PROPN'}, {'POS': 'ADP'}, {'POS': 'PROPN'}], 
[{'POS': 'PROPN'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}], 
[{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}], 
[{'POS': 'NOUN'}, {'POS': 'ADP'}, {'POS': 'NOUN'}], 
[{'POS': 'PROPN'}, {'POS': 'NOUN'}, {'POS': 'PROPN'}], 
[{'POS': 'VERB'}, {'POS': 'ADV'}], 
[{'POS': 'PROPN'}, {'POS': 'NOUN'}], 
] 
matcher = Matcher(nlp.vocab) 
matcher.add("pos-matcher", patterns) 
# create spacy object 
doc = nlp(keyword) 
# iterate through the matches 
matches = matcher(doc) 
# if matches is not empty, it means that it has found at least a match 
if len(matches) > 0: 
return True 
return False

基準測試函數(shù)

我們馬上就要完成了。這是啟動腳本和收集結果之前的最后一步。

我們將定義一個基準測試函數(shù)，它接收我們的語料庫和一個布爾值，用于對我們的數(shù)據(jù)進行打亂。對于每個提取器，它調用

extract_keywords_from_corpus 函數(shù)返回一個包含該提取器結果的字典。我們將該值存儲在列表中。

對于列表中的每個算法，我們計算

平均提取關鍵詞數(shù)
匹配關鍵字的平均數(shù)量
計算一個分數(shù)表示找到的平均匹配數(shù)除以執(zhí)行操作所花費的時間

我們將所有數(shù)據(jù)存儲在 Pandas DataFrame 中，然后將其導出為 .csv。

def get_sec(time_str): 
"""Get seconds from time.""" 
h, m, s = time_str.split(':') 
return int(h) * 3600 + int(m) * 60 + int(s) 
def benchmark(corpus, shuffle=True): 
"""This function runs the benchmark for the keyword extraction algorithms""" 
logging.info("Starting benchmark...\n") 
 
# Shuffle the corpus 
if shuffle: 
random.shuffle(corpus) 
# extract keywords from corpus 
results = [] 
extractors = [ 
rake_extractor,  
yake_extractor,  
topic_rank_extractor,  
position_rank_extractor, 
single_rank_extractor, 
multipartite_rank_extractor, 
keybert_extractor, 
] 
for extractor in extractors: 
result = extract_keywords_from_corpus(extractor, corpus) 
results.append(result) 
# compute average number of extracted keywords 
for result in results: 
len_of_kw_list = [] 
for kws in result["corpus_kws"].values(): 
len_of_kw_list.append(len(kws)) 
result["avg_keywords_per_document"] = np.mean(len_of_kw_list) 
# match keywords 
for result in results: 
for idx, kws in result["corpus_kws"].items(): 
match_results = [] 
for kw in kws: 
match_results.append(match(kw)) 
result["corpus_kws"][idx] = match_results 
# compute average number of matched keywords 
for result in results: 
len_of_matching_kws_list = [] 
for idx, kws in result["corpus_kws"].items(): 
len_of_matching_kws_list.append(len([kw for kw in kws if kw])) 
result["avg_matched_keywords_per_document"] = np.mean(len_of_matching_kws_list) 
# compute average percentange of matching keywords, round 2 decimals 
result["avg_percentage_matched_keywords"] = round(result["avg_matched_keywords_per_document"] / result["avg_keywords_per_document"], 2) 
 
# create score based on the avg percentage of matched keywords divided by time elapsed (in seconds) 
for result in results: 
elapsed_seconds = get_sec(result["elapsed_time"]) + 0.1 
# weigh the score based on the time elapsed 
result["performance_score"] = round(result["avg_matched_keywords_per_document"] / elapsed_seconds, 2) 
 
# delete corpus_kw 
for result in results: 
del result["corpus_kws"] 
# create results dataframe 
df = pd.DataFrame(results) 
df.to_csv("results.csv", index=False) 
logging.info("Benchmark finished. Results saved to results.csv") 
return df

結果

results = benchmark(texts[:2000], shuffle=True)

Python中7種主要關鍵詞提取算法的基準測試

下面是產生的報告

Python中7種主要關鍵詞提取算法的基準測試

我們可視化一下：

Python中7種主要關鍵詞提取算法的基準測試

根據(jù)我們定義的得分公式(

avg_matched_keywords_per_document/time_elapsed_in_seconds)， Rake 在 2 秒內處理 2000 個文檔，盡管準確度不如 KeyBERT，但時間因素使其獲勝。

如果我們只考慮準確性，計算為

avg_matched_keywords_per_document 和 avg_keywords_per_document 之間的比率，我們得到這些結果

Python中7種主要關鍵詞提取算法的基準測試

從準確性的角度來看，Rake 的表現(xiàn)也相當不錯。如果我們不考慮時間的話，KeyBERT 肯定會成為最準確、最有意義關鍵字提取的算法。Rake 雖然在準確度上排第二，但是差了一大截。

如果需要準確性，KeyBERT 肯定是首選，如果要求速度的話Rake肯定是首選，因為他的速度塊，準確率也算能接受吧。

責任編輯：華軒來源：今日頭條

Python 算法鍵詞

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

主站蜘蛛池模板：亚洲精品乱码久久久久久按摩观 | 成年人精品视频 | 日韩国产在线观看 | 在线成人 | 日韩欧美国产不卡 | 成年人黄色一级片 | 色天天综合 | 国产真实精品久久二三区 | 久久久久国产精品一区二区 | 欧美日韩综合一区 | 欧美天堂一区 | 婷婷福利 | 日本又色又爽又黄的大片 | 91精品欧美久久久久久久 | 日本午夜精品一区二区三区 | 国家一级黄色片 | 韩国av网站在线观看 | 草草草网站| 蜜臀久久99精品久久久久野外 | 欧美精品久久久 | 99国产精品久久久 | 欧美黑人国产人伦爽爽爽 | 天天想天天干 | 国产在线一区观看 | 国产精品久久在线 | h视频在线免费 | 精品91av | 欧美一区不卡 | 99资源站 | 草草网 | 天天操网 | 久久一 | 91精品久久久| 操网站| 国产日本精品视频 | 中文字幕在线中文 | 在线日韩 | 91av视频在线观看 | 亚洲精品一区国产精品 | 精品久久久久久亚洲综合网 | 国产激情免费视频 |

<font id="ypbmq"></font>

<tfoot id="ypbmq"></tfoot>

<tfoot id="ypbmq"><label id="ypbmq"></label></tfoot>

<font id="ypbmq"></font>