python删除以' \ u ...'开头的单词

时间:2015-11-02 01:38:16

标签: python regex

如何删除所有以' \ u ...'

开头的字词
count_all = Counter()
for sentence in tweets[:100]:
    cleaned_terms = [term for term in preprocess(sentence.lower()) if term not in stop]
    count_all.update(cleaned_terms)

print count_all.most_common(5)

输出:

#[(u'#halloween', 100), (u'\ud83d', 52), (u'\u2026', 28), (u'\ud83c', 24), (u'halloween', 14)]

1 个答案:

答案 0 :(得分:1)

\ uXXXX对应于Unicode字符(例如,2026 =单个字符省略号,......)。找到非ASCII的最简单的选择是检查你的理解中是否ord(term[0]) > 255,但实际上你想做的事情是否取决于你的特定用例。