Python中七種主要關鍵詞提取算法的基準測試
我一直在尋找有效關鍵字提取任務算法。 目標是找到一種算法,能夠以有效的方式提取關鍵字,并且能夠平衡提取質量和執(zhí)行時間,因為我的數(shù)據(jù)語料庫迅速增加已經達到了數(shù)百萬行。 我對于算法一個主要的要求是提取關鍵字本身總是要有意義的,即使脫離了上下文的語境也能夠表達一定的含義。
本篇文章使用 2000 個文檔的語料庫對幾種著名的關鍵字提取算法進行測試和試驗。
使用的庫列表
我使用了以下python庫進行研究
NLTK,以幫助我在預處理階段和一些輔助函數(shù)
- RAKE
- YAKE
- PKE
- KeyBERT
- Spacy
Pandas 和Matplotlib還有其他通用庫
實驗流程
基準測試的工作方式如下

我們將首先導入包含我們的文本數(shù)據(jù)的數(shù)據(jù)集。 然后,我們將為每個算法創(chuàng)建提取邏輯的單獨函數(shù)
algorithm_name(str: text) → [keyword1, keyword2, ..., keywordn]
然后,我們創(chuàng)建的一個函數(shù)用于提取整個語料庫的關鍵詞。
extract_keywords_from_corpus(algorithm, corpus) → {algorithm, corpus_keywords, elapsed_time}
下一步,使用Spacy幫助我們定義一個匹配器對象,用來判斷關鍵字是否對我們的任務有意義,該對象將返回 true 或 false。
最后,我們會將所有內容打包到一個輸出最終報告的函數(shù)中。
數(shù)據(jù)集
我使用的是來自互聯(lián)網的小文本數(shù)數(shù)據(jù)集。這是一個樣本
- ['To follow up from my previous questions. . Here is the result!\n',
- 'European mead competitions?\nI’d love some feedback on my mead, but entering the Mazer Cup isn’t an option for me, since shipping alcohol to the USA from Europe is illegal. (I know I probably wouldn’t get caught/prosecuted, but any kind of official record of an issue could screw up my upcoming citizenship application and I’m not willing to risk that).\n\nAre there any European mead comps out there? Or at least large beer comps that accept entries in the mead categories and are likely to have experienced mead judges?', 'Orange Rosemary Booch\n', 'Well folks, finally happened. Went on vacation and came home to mold.\n', 'I’m opening a gelato shop in London on Friday so we’ve been up non-stop practicing flavors - here’s one of our most recent attempts!\n', "Does anyone have resources for creating shelf stable hot sauce? Ferment and then water or pressure can?\nI have dozens of fresh peppers I want to use to make hot sauce, but the eventual goal is to customize a recipe and send it to my buddies across the States. I believe canning would be the best way to do this, but I'm not finding a lot of details on it. Any advice?", 'what is the practical difference between a wine filter and a water filter?\nwondering if you could use either', 'What is the best custard base?\nDoes someone have a recipe that tastes similar to Culver’s frozen custard?', 'Mold?\n'
大部分是與食物相關的。我們將使用2000個文檔的樣本來測試我們的算法。
我們現(xiàn)在還沒有對文本進行預處理,因為有一些算法的結果是基于stopwords和標點符號的。
算法
讓我們定義關鍵字提取函數(shù)。
- # initiate BERT outside of functions
- bert = KeyBERT()
- # 1. RAKE
- def rake_extractor(text):
- """
- Uses Rake to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- r = Rake()
- r.extract_keywords_from_text(text)
- return r.get_ranked_phrases()[:5]
- # 2. YAKE
- def yake_extractor(text):
- """
- Uses YAKE to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- keywords = yake.KeywordExtractor(lan="en", n=3, windowsSize=3, top=5).extract_keywords(text)
- results = []
- for scored_keywords in keywords:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 3. PositionRank
- def position_rank_extractor(text):
- """
- Uses PositionRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- # define the valid Part-of-Speeches to occur in the graph
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor = pke.unsupervised.PositionRank()
- extractor.load_document(text, language='en')
- extractor.candidate_selection(pos=pos, maximum_word_number=5)
- # 4. weight the candidates using the sum of their word's scores that are
- # computed using random walk biaised with the position of the words
- # in the document. In the graph, nodes are words (nouns and
- # adjectives only) that are connected if they occur in a window of
- # 3 words.
- extractor.candidate_weighting(window=3, pos=pos)
- # 5. get the 5-highest scored candidates as keyphrases
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 4. SingleRank
- def single_rank_extractor(text):
- """
- Uses SingleRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor = pke.unsupervised.SingleRank()
- extractor.load_document(text, language='en')
- extractor.candidate_selection(pos=pos)
- extractor.candidate_weighting(window=3, pos=pos)
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 5. MultipartiteRank
- def multipartite_rank_extractor(text):
- """
- Uses MultipartiteRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- extractor = pke.unsupervised.MultipartiteRank()
- extractor.load_document(text, language='en')
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor.candidate_selection(pos=pos)
- # 4. build the Multipartite graph and rank candidates using random walk,
- # alpha controls the weight adjustment mechanism, see TopicRank for
- # threshold/method parameters.
- extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 6. TopicRank
- def topic_rank_extractor(text):
- """
- Uses TopicRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- extractor = pke.unsupervised.TopicRank()
- extractor.load_document(text, language='en')
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor.candidate_selection(pos=pos)
- extractor.candidate_weighting()
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 7. KeyBERT
- def keybert_extractor(text):
- """
- Uses KeyBERT to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- keywords = bert.extract_keywords(text, keyphrase_ngram_range=(3, 5), stop_words="english", top_n=5)
- results = []
- for scored_keywords in keywords:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
每個提取器將文本作為參數(shù)輸入并返回一個關鍵字列表。對于使用來講非常簡單。
注意:由于某些原因,我不能在函數(shù)之外初始化所有提取器對象。每當我這樣做時,TopicRank和MultiPartiteRank都會拋出錯誤。就性能而言,這并不完美,但基準測試仍然可以完成。

我們已經通過傳遞 pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 來限制一些可接受的語法模式——這與 Spacy 一起將確保幾乎所有的關鍵字都是從人類語言視角來選擇的。 我們還希望關鍵字包含三個單詞,只是為了有更具體的關鍵字并避免過于籠統(tǒng)。
從整個語料庫中提取關鍵字
現(xiàn)在讓我們定義一個函數(shù),該函數(shù)將在輸出一些信息的同時將單個提取器應用于整個語料庫。
- def extract_keywords_from_corpus(extractor, corpus):
- """This function uses an extractor to retrieve keywords from a list of documents"""
- extractor_name = extractor.__name__.replace("_extractor", "")
- logging.info(f"Starting keyword extraction with {extractor_name}")
- corpus_kws = {}
- start = time.time()
- # logging.info(f"Timer initiated.") <-- uncomment this if you want to output start of timer
- for idx, text in tqdm(enumerate(corpus), desc="Extracting keywords from corpus..."):
- corpus_kws[idx] = extractor(text)
- end = time.time()
- # logging.info(f"Timer stopped.") <-- uncomment this if you want to output end of timer
- elapsed = time.strftime("%H:%M:%S", time.gmtime(end - start))
- logging.info(f"Time elapsed: {elapsed}")
- return {"algorithm": extractor.__name__,
- "corpus_kws": corpus_kws,
- "elapsed_time": elapsed}
這個函數(shù)所做的就是將傳入的提取器數(shù)據(jù)和一系列有用的信息組合成一個字典(比如執(zhí)行任務花費了多少時間)來方便我們后續(xù)生成報告。
語法匹配函數(shù)
這個函數(shù)確保提取器返回的關鍵字始終(幾乎?)意義。 例如,

我們可以清楚地了解到,前三個關鍵字可以獨立存在,它們完全是有意義的。我們不需要更多信息來理解關鍵詞的含義,但是第四個就毫無任何意義,所以需要盡量避免這種情況。
Spacy 與 Matcher 對象可以幫助我們做到這一點。 我們將定義一個匹配函數(shù),它接受一個關鍵字,如果定義的模式匹配,則返回 True 或 False。
- def match(keyword):
- """This function checks if a list of keywords match a certain POS pattern"""
- patterns = [
- [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'VERB'}],
- [{'POS': 'NOUN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'NOUN'}, {'POS': 'VERB'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'ADV'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'VERB'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}],
- [{'POS': 'NOUN'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'ADP'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'NOUN'}, {'POS': 'ADP'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'NOUN'}, {'POS': 'PROPN'}],
- [{'POS': 'VERB'}, {'POS': 'ADV'}],
- [{'POS': 'PROPN'}, {'POS': 'NOUN'}],
- ]
- matcher = Matcher(nlp.vocab)
- matcher.add("pos-matcher", patterns)
- # create spacy object
- doc = nlp(keyword)
- # iterate through the matches
- matches = matcher(doc)
- # if matches is not empty, it means that it has found at least a match
- if len(matches) > 0:
- return True
- return False
基準測試函數(shù)
我們馬上就要完成了。 這是啟動腳本和收集結果之前的最后一步。
我們將定義一個基準測試函數(shù),它接收我們的語料庫和一個布爾值,用于對我們的數(shù)據(jù)進行打亂。 對于每個提取器,它調用
extract_keywords_from_corpus 函數(shù)返回一個包含該提取器結果的字典。 我們將該值存儲在列表中。
對于列表中的每個算法,我們計算
- 平均提取關鍵詞數(shù)
- 匹配關鍵字的平均數(shù)量
- 計算一個分數(shù)表示找到的平均匹配數(shù)除以執(zhí)行操作所花費的時間
我們將所有數(shù)據(jù)存儲在 Pandas DataFrame 中,然后將其導出為 .csv。
- def get_sec(time_str):
- """Get seconds from time."""
- h, m, s = time_str.split(':')
- return int(h) * 3600 + int(m) * 60 + int(s)
- def benchmark(corpus, shuffle=True):
- """This function runs the benchmark for the keyword extraction algorithms"""
- logging.info("Starting benchmark...\n")
- # Shuffle the corpus
- if shuffle:
- random.shuffle(corpus)
- # extract keywords from corpus
- results = []
- extractors = [
- rake_extractor,
- yake_extractor,
- topic_rank_extractor,
- position_rank_extractor,
- single_rank_extractor,
- multipartite_rank_extractor,
- keybert_extractor,
- ]
- for extractor in extractors:
- result = extract_keywords_from_corpus(extractor, corpus)
- results.append(result)
- # compute average number of extracted keywords
- for result in results:
- len_of_kw_list = []
- for kws in result["corpus_kws"].values():
- len_of_kw_list.append(len(kws))
- result["avg_keywords_per_document"] = np.mean(len_of_kw_list)
- # match keywords
- for result in results:
- for idx, kws in result["corpus_kws"].items():
- match_results = []
- for kw in kws:
- match_results.append(match(kw))
- result["corpus_kws"][idx] = match_results
- # compute average number of matched keywords
- for result in results:
- len_of_matching_kws_list = []
- for idx, kws in result["corpus_kws"].items():
- len_of_matching_kws_list.append(len([kw for kw in kws if kw]))
- result["avg_matched_keywords_per_document"] = np.mean(len_of_matching_kws_list)
- # compute average percentange of matching keywords, round 2 decimals
- result["avg_percentage_matched_keywords"] = round(result["avg_matched_keywords_per_document"] / result["avg_keywords_per_document"], 2)
- # create score based on the avg percentage of matched keywords divided by time elapsed (in seconds)
- for result in results:
- elapsed_seconds = get_sec(result["elapsed_time"]) + 0.1
- # weigh the score based on the time elapsed
- result["performance_score"] = round(result["avg_matched_keywords_per_document"] / elapsed_seconds, 2)
- # delete corpus_kw
- for result in results:
- del result["corpus_kws"]
- # create results dataframe
- df = pd.DataFrame(results)
- df.to_csv("results.csv", index=False)
- logging.info("Benchmark finished. Results saved to results.csv")
- return df
結果
- results = benchmark(texts[:2000], shuffle=True)

下面是產生的報告

我們可視化一下:

根據(jù)我們定義的得分公式(
avg_matched_keywords_per_document/time_elapsed_in_seconds), Rake 在 2 秒內處理 2000 個文檔,盡管準確度不如 KeyBERT,但時間因素使其獲勝。
如果我們只考慮準確性,計算為
avg_matched_keywords_per_document 和 avg_keywords_per_document 之間的比率,我們得到這些結果

從準確性的角度來看,Rake 的表現(xiàn)也相當不錯。如果我們不考慮時間的話,KeyBERT 肯定會成為最準確、最有意義關鍵字提取的算法。Rake 雖然在準確度上排第二,但是差了一大截。
如果需要準確性,KeyBERT 肯定是首選,如果要求速度的話Rake肯定是首選,因為他的速度塊,準確率也算能接受吧。