Question

我正在使用Twitter做一些工作，很多推文看起来像

麻疹\ xd2 @ theblackpenseur：gonorrhea rt @kylegotjokes：艾滋病rt \ xd2 @cache ___：我头疼得厉害？\ xd3

我认为\xd2位是Emojis（尽管我可能错了，并希望得到纠正）。

如何在保持字符串完整的同时从字符串中删除这些内容？

Answer 1

根据您要清理数据的程度，您可以使用

>>> import string
>>> tweet = 'measles \xd2@theblackpenseur: gonorrhea rt @kylegotjokes: aids rt \xd2@cache___: my head itching so bad ?\xd3'
>>> filter(lambda x: x in string.printable, tweet)
'measles @theblackpenseur: gonorrhea rt @kylegotjokes: aids rt @cache___: my head itching so bad ?'

Answer 2

这听起来有点像自我推销（更多的是这个问题有多久），但我有一个可以做到这一点的Python库（除此之外）。该库是cucco，基本上你可以这样做：

from cucco import Cucco
cucco = Cucco()
cucco.remove_stop_words('Your text')

从字符串中删除Emojis

2 个答案: