R 0 Comments Edit Copy Download

這篇只是很短地記錄解決在R中使用wordcloud套件畫不出中文文字雲的原因與做法。

文字雲都是□亂碼 / □ word cloud

根據Basic Text Mining in R、陳嘉葳的用R進行中文 text Mining、還有Byran的[R] TEXT MINING(文字探勘、文本分析練習)這幾篇的教學，我終於處理到可以用wordcloud套件畫文字雲的進度。

原本的斷詞與詞頻結果如下：

Terms    ppt2.txt ppt.txt
  公主          1       0
  原因          1       0
  同學          0       1
  名字          1       0
  地板          1       0
  小女          1       0
  手機          1       0
  水質          1       0
  畢業生        0       1
  白痴          1       0

但是wordcloud輸出的結果卻是如下圖：

怎麼中文會變成□亂碼？難道這又是編碼(encoding)的問題嗎？

不，作為一個Linux常識人，如果看到□的話我們第一個要想到的問題是：系統中沒有中文字形。

出現這樣的問題，其原因就是因為我使用的作業系統是英文版的Ubuntu 14.04，是全新空的作業系統，而不是一般教學文所使用的Windows。在預設英文的Ubuntu裡面沒有中文字形，那麼wordcloud當然畫不出含有中文字的文字雲囉。

在Ubuntu中安裝中文字形 / Install Chinese Font in Ubuntu

要安裝中文字形，我們應該就從開放原始碼的字形中來挑選。我就參考Ubuntu 安裝思源字體這篇，安裝Google推出的思源CJK字體(note)吧。

因為這一篇我沒打算仔細講，所以大概列出我處理的大綱：

在Google Note Fonts下載字形，選擇DOWNLOAD ALL FONTS：
https://www.google.com/get/noto/
在Linux現在使用的家目錄底下建立.fonts/noto資料夾：
```
mkdir -p ~/.font/noto
```
從下載下來的思源字形壓縮檔Noto-hinted.zip中解壓縮副檔名為.otf的檔案，上傳到 ~/.fonts/note

建立~/.fonts.conf檔案，內容如下：

<fontconfig>
  <match target="pattern">
    <test qual="any" name="family">
      <string>sans-serif</string>
    </test>
    <edit name="family" mode="prepend" binding="strong">
      <string>Noto Sans T Chinese</string>
      <string>Noto Sans S Chinese</string>
      <string>Noto Sans Japanese</string>
      <string>Noto Sans Korean</string>
    </edit>
  </match>
</fontconfig>

重新開機

再次建立文字雲 / Create word cloud again

再次回到R中執行wordcloud()產生文字雲的語法，稍等一段時間之後，就能夠正常畫出如上圖的文字雲了(結論是老婆最大？)。至於如何調整字形之間的間距、選擇顏色之類的細節，就有待後續研究。

布丁布丁吃什麼？

R的文字雲怎麼都是□亂碼？wordcloud套件需要中文字形 / Why R’s wordcloud draws Chinese in □?

11月 06, 2016 Programming/R , Software/R 0 Comments Edit Copy Download

文字雲都是□亂碼 / □ word cloud

在Ubuntu中安裝中文字形 / Install Chinese Font in Ubuntu

再次建立文字雲 / Create word cloud again

About Me

布丁布丁吃布丁

Luminous Blessing (Donors)

Podcast

Facebook Fanpage

Random Posts

Guestbook

Comments

Recent Posts

聯絡布丁

Buy Me A Coffee (Donate)

Labels (All Labels)

R的文字雲怎麼都是□亂碼？wordcloud套件需要中文字形 / Why R’s wordcloud draws Chinese in □? 11月 06, 2016 Programming/R , Software/R 0 Comments Edit Copy Download

文字雲都是□亂碼 / □ word cloud

在Ubuntu中安裝中文字形 / Install Chinese Font in Ubuntu

再次建立文字雲 / Create word cloud again

Related Posts

About Me

布丁布丁吃布丁

Luminous Blessing (Donors)

Podcast

Facebook Fanpage

Random Posts

Guestbook

Comments

Recent Posts

R的文字雲怎麼都是□亂碼？wordcloud套件需要中文字形 / Why R’s wordcloud draws Chinese in □?

11月 06, 2016 Programming/R , Software/R 0 Comments Edit Copy Download