tag:blogger.com,1999:blog-16607461.post2252825662354462086..comments2024-03-29T10:21:47.284+08:00Comments on 布丁布丁吃什麼?: 簡易PHP中文斷字器 / A Simple Chinese Word Tokenizer in PHP布丁布丁吃布丁http://www.blogger.com/profile/13614721642960940190noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-16607461.post-42953995584455199822017-11-15T23:14:38.938+08:002017-11-15T23:14:38.938+08:00對了,這也是實作「單字詞」、「一字詞」、「unigrams」的做法喔
有需要的人可以參考看看
絕對...對了,這也是實作「單字詞」、「一字詞」、「unigrams」的做法喔<br />有需要的人可以參考看看<br /><br />絕對不是單純取字串的第i個字這麼簡單喔布丁布丁吃布丁https://www.blogger.com/profile/18000418899714977849noreply@blogger.comtag:blogger.com,1999:blog-16607461.post-25109755919552757802017-04-03T00:15:31.920+08:002017-04-03T00:15:31.920+08:00我做了一個JavaScript斷字器的版本,在這裡記錄一下:
https://github.com/...我做了一個JavaScript斷字器的版本,在這裡記錄一下:<br />https://github.com/pulipulichen/jieba-js/blob/54350cae3ea95e18c326c6443a9237afcd979fcd/weka/spreadsheet2arff/script.js#L224<br /><br />var _add_chinese_space = function(_content) {<br /> if( Object.prototype.toString.call( _content ) === '[object Array]' ) {<br /> var _new_content = [];<br /> for (var _i = 0; _i < _content.length; _i++) {<br /> _new_content.push(_add_chinese_space(_content[_i]));<br /> }<br /> return _new_content;<br /> }<br /> <br /> var _result = _content;<br /> <br /> _result = _result.replace(/([_]|[\W])/g,function (_matches, _contents, _offset, _s) {<br /> if (_matches[0].match(/[0-9\s]/)) {<br /> return _matches[0];<br /> }<br /> else {<br /> return " " + _matches[0] + " ";<br /> }<br /> });<br /> _result = _result.replace(/@[\x00-\x08\x0B\x0C\x0E-\x1F]@/g, ' '); // 避免Solr illegal characters<br /> _result = _result.replace(/\s+/g, ' ');<br /> _result = _result.trim();<br /> return _result;<br />};布丁布丁吃布丁https://www.blogger.com/profile/18000418899714977849noreply@blogger.com