清洁推文的问题(表情符号,表情符号...)

时间:2019-03-19 12:57:01

标签: python regex unicode tweets emoticons

我在清洁推文时遇到问题。我有一个将推文保存在csv中的过程,然后对数据进行了熊猫数据帧处理。

x是我的数据帧中的一条推文:

'b\'RT @LBC: James O\\\'Brien on Geoffrey Cox\\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\\xe2\\x80\\xa6\''

更多推文: "b'RT @suzannelynch1: Meanwhile in #Washington... Almost two dozen members of #Congress write to #TheresaMay on eve of #StPatricksDay visit wa\\xe2\\x80\\xa6'

b"RT @KMTV_Kent: #KentTonight Poll:\\nKent\'s MPs will be having their say on Theresa May\'s #Brexit deal today. @SirRogerGaleMP said he\'ll back\\xe2\\x80\\xa6"

结果应如下所示: James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for' (保留标签,只删除任何utf8字符)

我想清除此推文。我尝试将regex与re.sub(my_regex),re.compile一起使用...

我尝试了不同的正则表达式:([\ U00010000- \ U0010ffff],r'@ [A-Za-z0-9] +',https?:// [A-Za-z0-9./] +)

我也尝试过这样:

x.encode('ascii','ignore').decode('utf-8')  

由于有两个反斜杠,所以它不起作用,并且在我这样做时起作用:

'to tell us whether or not fore\xe2\x80\xa6'.encode('ascii','ignore').decode('utf-8')

它返回我:

'to tell us whether or not fore'

有人知道如何清洁吗? 非常感谢 !

1 个答案:

答案 0 :(得分:1)

看看是否有帮助

a = 'b\'RT @LBC: James O\\\'Brien on Geoffrey Cox\\\'s awaited legal advice:     "We are waiting for a single unelected expert to tell us whether or not fore\\xe2\\x80\\xa6\''

chars = re.findall("""[\s"'#]+\w+""",a)

''.join([c for c in chars if c])

输出

James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'