Question

我是python-beginner。我已经执行了所有NLP步骤来清理我的文本。最后我阻止了我的字符串然后删除它的所有停用词。问题在于，当我删除文本时，我在每个词干之前都有一个字母'u'。我想从全文中删除这封信。

这是输入：

['other'，'program'，u'crash'，'win'，'restart'，'或'，'reboot'，'then'，u'bookmark'，'and'，u'person' ，'u''，'u'delet'，'发生'，'到'，'我'，'三'，'时间'，u'mani'，u'program'，'是'，u'open'， '和'，'one'，u'crash'，'i'，'reboot'，'the'，'system'，'when'，'the'，'system'，u'restart'，'mozilla'， 's'，u'bookmark'，'和'，u'person'，u'set'，u'prefer'，u'save'，u'login'，u'pass'，u'vanish'，'the the '，u'prefer'，'are'，'set'，'back'，'to'，'the'，'default'，u'seem'，'to'，'me'，'that'，'mozilla '，'u'，''是'，u'interupt'，'u'write'，u'save'，u'it'，'current'，u'set'，'when'，'i'，u'重启'，u'thu'，u'eras'，'it']

必需的输出：

['other'，'program'，'crash'，'win'，'restart'，'或'，'reboot'，'then'，'bookmark'，'and'，'person'，'set '，'delet'，'events'，'to'，'me'，'three'，'time'，'mani'，'program'，'are'，'open'，'and'，'one'， '崩溃'，'我'，'重启'，'''，'系统'，'何时'，'''，'系统'，'重启'，'mozilla'，'s'，'书签'，'和'，'person'，'set'，'prefer'，'save'，'login'，'pass'，'vanish'，'the'，'prefer'，'are'，'set'，'back'， 'to'，'the'，'default'，'似乎'，'to'，'me'，'that'，'mozilla'，u'ha'，'been'，'interupt'，'write'，'保存'，'它'，'当前'，'设置'，'何时'，'我'，'重启'，'星期四'，'时代'，'它']

这是我的代码......

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in token]
print stemmed

Answer 1

你在Python 2中表示unicode字符串。将它们全部编码为ascii，使输出看起来像你想要的那样：

stemmed_normalized = [element.encode("ascii") for element in stemmed]
print stemmed_normalized

或者你可以一步完成

[porter.stem(word).encode("ascii") for word in token]

使用python从字符串中删除除标记化单词之外的所有字符

1 个答案: