如何解析
等文本“文本细分是将书面文字划分为有意义的单位的过程,例如单词,句子或主题。 - .Mr.Wikipedia”
到
词= [ “文本”, “分割”, “是”, “该”, “过程”, “的”, “分割”, “写入”, “文本”, “进”,”有意义”, “单元”, “诸如”, “如”, “字”, “句子”, “或”, “主题”, “Mr.Wikipedia]”
我已经尝试过替换技术,但我最终得到了几个连续的空间。
答案 0 :(得分:1)
如果您使用Python编写代码,请使用NLTK(请参阅NLTK book)
import nltk
sentence = "Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. --Mr.Wikipedia"
words = nltk.word_tokenize(sentence)
对于Java,请尝试OpenNLP tokenizer