Python 0 Comments Edit Copy Download

哈囉，大家最近在吃什麼呢？這裡是思考傳統摘要與自動化摘要之間有何異同的布丁。這篇是我之前介紹自動文字摘要演算法TextRank的10分鐘簡短投影片，並附上實作可以使用的Python腳本，提供給有需要的朋友使用。

關於TextRank / About TextRank

(圖片來源：Mihalcea & Tarau, 2004)

TextRank是Mihalcea跟Tarau在2004年自然語言處理實務研討會(empirical methods in natural language processing)中發表的論文「Textrank: Bringing order into text」。以往自然語言處理方法大多著重在語言的語法結構，而TextRank則是參考了PageRank的做法，用社會網路分析的角度來進行分析。

TextRank可以用於自動選出重要的關鍵字或句子。而多個句子即可組成摘要。在自動文字摘要技術中，TextRank屬於抽取式摘要法(extractive approach)，它僅是選出重要的句子，而非重新改寫原本的內容。它也是屬於非監督式(unsupervised)的方法。建立模型的時候，我們不需要事前準備訓練資料。

TextRank的優點是運算速度快、不需事前訓練與建立模型、特別適合用於形態多變、時常出現新興詞彙的文本內容。另一方面， TextRank的缺點是抽取出的摘要並非一般人認知上的「摘要」，而僅是文本中的句子。而且TextRank所選擇的句子和關鍵詞會有過於相似的問題。如果文本中有多個不同的主題，TextRank大多時候抽取的句子和關鍵詞都只會反映其中一個主題。

引用 / Citation

Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing order into text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 404–411.

論文摘要 / Abstract

本論文旨在介紹TextRank──一種基於網路圖形排序模型(graph-based ranking model)的文字處理方法，並展示它如何成功應用於自然語言處理中。特別的是，我們提出了用於關鍵字和句子抽取的兩種創新的非監督式(unsupervised)學習方法，而分析結果優於先前的方法。

投影片 / Slide

Open in new window
Open in popup window
轉換成Power Point格式後的備份：GitHub、Google Drive NCCU、One Drive、Mega、Box、MediaFire、SlideShare

實作練習 / Practice

我參考了pytextrank的程式碼，將之整理成可以直接在Colab使用的形式。這樣方便讓大家直接在Colab執行分析結果，不需要特別準備Python運作環境，修改幾個參數就可以調整輸出結果。以下就讓我們來看看怎麼實作吧。

Step 1. 開啟Colab的新筆記本 / Open new notebook in Colab

開啟Colab的新筆記本

請點選上面的網址，在Colab內開啟一個新的筆記本。

Colab (全名為「Colaboratory」)是由Google開發的應用工具，可讓使用者直接在瀏覽器中編寫及執行 Python 程式碼。使用的時候需要登入Google帳號。

Step 2. 貼上textrank-practice.py程式碼 / Paste textrank-practice.py to code

textrank-practice.py

請從上面的網址開啟textrank-practice.py腳本的網頁，並點下「Copy raw contents」的按鈕，直接複製textrank-practice.py腳本的內容。「Copy raw contents」的按鈕在檔案內文上方的工具列右側，位於「Raw」跟「Blame」的旁邊。

接著再回到 Colab，在寫程式的Code區塊裡，按Ctrl + v貼上腳本內容。上圖是貼上腳本內容後的樣子。

Step 3. 準備文本檔案input.txt / Prepare input.txt

再來我們要準備要進行抽取關鍵字和摘要的文本檔案。目前這個腳本只能處理英文，請將檔案使用UTF-8編碼儲存，並將檔名設定為「input.txt」。

input.txt 範例檔案

你可以從上面的連結直接下載input.txt範例檔案。範例檔案的內容如下：

Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.

Step 4. 上傳檔案input.txt / Upload input.txt

我們接下來要把input.txt上傳到Colab。回到Colab，請點下左邊的資料夾圖示Files，開啟檔案面板。

Colab開啟檔案面板需要花些時間。如果檔案面板準備好了，你就可以在「 Files」標題以下的工具列中，找到「Upload to session storage」(上傳到現階段的儲存空間中)按鈕。

選擇要上傳的input.txt檔案。

第一次上傳的時候會看到提示。現在在Colab運作的腳本只是暫存在雲端上，一般情況下只會保存12小時。如果你的腳本需要執行超過12小時，那還是建議在本機端執行Python腳本吧。

上傳後，input.txt就會出現在檔案面板中。我們可以直接在Colab檢視、修改與下載這些檔案。

Step 5. 修改腳本的設定參數 / Edit the configuration in script

在執行Python腳本之前，我們先來看看腳本最上面的設定參數吧。

我特別整理出兩個比較常用的參數，程式碼在腳本的最前面，內容如下：

# Configuration
output_phrases_count = 10
output_sentences_count = 3

output_phases_count: 影響輸出關鍵字的數量。數字越大，輸出的關鍵字越多。
output_sentences_count: 影響輸出摘要句子的數量。數字越大，TextRank選擇的句子數量越多。

Step 6. 執行腳本 / Run the script

input.txt準備好、參數設定好之後，我們就可以按下程式碼左邊上面的執行按鈕。

第一次執行的時候會花很多時間，這是因為Python腳本會直接在Colab裡面安裝必要的套件。安裝套件的指令只需要進行一次，第二次再執行時就會略過安裝套件，執行速度就會快很多。

看到執行按鈕左邊出現綠色勾勾，表示已經執行完成。

再稍待一會兒，檔案面板就會出現執行完成的分析結果的兩個檔案：output-phrases.txt跟output-sentences.txt。

Step 7. 觀察分析結果 / Get outputs

如果我們在分析結果檔案上點兩下，就可以在 Colab直接檢視檔案內容。在檔案右邊的選單中，也可以下載檔案。

關鍵字分析結果 / Phrases

TextRank挑選出的關鍵字保存在output-phrases.txt裡面。檔案內容如下：

mixed types
minimal generating sets
systems
nonstrict inequations
strict inequations
natural numbers
linear Diophantine equations
solutions
linear constraints
a minimal supporting set

你可以注意到 TextRank抽取的關鍵字不見得是「一個字」，可能會是由多個字組成的「片語」。這是因為TextRank不僅分析了詞頻，還同時考慮了文字的順序。如果重要的關鍵字彼此相鄰，TextRank就會把它當作一組「片語」來輸出。

由於前面參數設定了「output_phrases_count = 10」，在此只會輸出前10個重要的關鍵字。如果你需要更多關鍵字，可以調整前面的參數。

句子摘要 / Sentences

接著讓我們來看看TextRank選出的重要句子。output-sentences.txt的檔案內容如下：

These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.
Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given.
Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.

這裡用換行表示三個不同的句子。如果你希望TextRank找出更多句子，請修改前面的參數設定「output_sentences_count = 3」。

結語 / In closing

(圖片來源：Mihalcea & Tarau, 2004)

TextRank是自動文字摘要中相當經典的一個方法。它不僅有著驚人的創意，而且原理也不難理解。受到TextRank的啟發，後續許多研究不斷提出了各種變形的應用，包括2019的TextRank改良和2021的BART-TextRank。在不需要預先準備訓練資料的非監督式自動文字摘要領域裡，TextRank是相當值得認識的演算法。

如何處理中文？ / How to process Chinese text?

本篇附帶的TextRank實作腳本只能處理英文文字。如果你想要處理中文，那就需要進行斷詞分析。我以前有寫作用 Python2中的Jieba套件進行斷詞的做法，但Python2已經不適合現在以Python3為主的環境。如果要處理臺灣為主的文本，那我推薦使用中央研究院語言學研究所所發佈的CkipTagger。關於CkipTagger的用法，可以參考Clay所撰寫的「[NLP][Python] 透過 ckiptagger 來使用繁體中文斷詞的最佳工具 CKIP」。

那麼這次TextRank的介紹就到這裡了。寫到最後，我來問個自動文字摘要的應用問題：

你會想要在什麼樣的文本上整理出摘要呢？

A. 新聞: 每天新聞太多啦，看不完！我需要TextRank小幫手幫我摘要！
B. 網路論壇: 八卦版跟政黑板太亂了，可以幫我整理一下嗎？
C. 論文: 老師叫我review論文，但論文太多了根本看不完。救救我啊哆啦TextRank！
D. 其他：我有其他的應用情境。

歡迎在下面的留言處跟大家分享你的想法。你的意見都是我繼續分享的動力喔！

如果你覺得我這篇實用的話，請幫我在AddThis分享工具按讚，或是將這篇分享到Facebook等社群媒體吧！想在社群媒體追蹤我的話，歡迎到我的Facebook粉絲專頁「布丁布丁吃什麼？」按個讚喔！

不知道這篇文章有沒有讓你吃到什麼有趣的東西呢？我是布丁，期待下次見！

布丁布丁吃什麼？

TextRank簡介 / Introduction to TextRank

4月 09, 2022 Data Mining , Machine Learning , Presentation , Programming/Python 0 Comments Edit Copy Download

關於TextRank / About TextRank

引用 / Citation

論文摘要 / Abstract

投影片 / Slide

實作練習 / Practice

Step 1. 開啟Colab的新筆記本 / Open new notebook in Colab

Step 2. 貼上textrank-practice.py程式碼 / Paste textrank-practice.py to code

Step 3. 準備文本檔案input.txt / Prepare input.txt

Step 4. 上傳檔案input.txt / Upload input.txt

Step 5. 修改腳本的設定參數 / Edit the configuration in script

Step 6. 執行腳本 / Run the script

Step 7. 觀察分析結果 / Get outputs

關鍵字分析結果 / Phrases

句子摘要 / Sentences

結語 / In closing

如何處理中文？ / How to process Chinese text?

About Me

布丁布丁吃布丁

Luminous Blessing (Donors)

Podcast

Facebook Fanpage

Random Posts

Guestbook

Comments

Recent Posts

聯絡布丁

Buy Me A Coffee (Donate)

Labels (All Labels)

TextRank簡介 / Introduction to TextRank 4月 09, 2022 Data Mining , Machine Learning , Presentation , Programming/Python 0 Comments Edit Copy Download

關於TextRank / About TextRank

引用 / Citation

論文摘要 / Abstract

投影片 / Slide

實作練習 / Practice

Step 1. 開啟Colab的新筆記本 / Open new notebook in Colab

Step 2. 貼上textrank-practice.py程式碼 / Paste textrank-practice.py to code

Step 3. 準備文本檔案input.txt / Prepare input.txt

Step 4. 上傳檔案input.txt / Upload input.txt

Step 5. 修改腳本的設定參數 / Edit the configuration in script

Step 6. 執行腳本 / Run the script

Step 7. 觀察分析結果 / Get outputs

關鍵字分析結果 / Phrases

句子摘要 / Sentences

結語 / In closing

如何處理中文？ / How to process Chinese text?

Related Posts

About Me

布丁布丁吃布丁

Luminous Blessing (Donors)

Podcast

Facebook Fanpage

Random Posts

Guestbook

Comments

Recent Posts

TextRank簡介 / Introduction to TextRank

4月 09, 2022 Data Mining , Machine Learning , Presentation , Programming/Python 0 Comments Edit Copy Download