如何删除所有以' \ u ...'
开头的字词count_all = Counter()
for sentence in tweets[:100]:
cleaned_terms = [term for term in preprocess(sentence.lower()) if term not in stop]
count_all.update(cleaned_terms)
print count_all.most_common(5)
输出:
#[(u'#halloween', 100), (u'\ud83d', 52), (u'\u2026', 28), (u'\ud83c', 24), (u'halloween', 14)]
答案 0 :(得分:1)
\ uXXXX对应于Unicode字符(例如,2026 =单个字符省略号,......)。找到非ASCII的最简单的选择是检查你的理解中是否ord(term[0]) > 255
,但实际上你想做的事情是否取决于你的特定用例。