Question

我在清洁推文时遇到问题。我有一个将推文保存在csv中的过程，然后对数据进行了熊猫数据帧处理。

x是我的数据帧中的一条推文：

'b\'RT @LBC: James O\\\'Brien on Geoffrey Cox\\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\\xe2\\x80\\xa6\''

更多推文： "b'RT @suzannelynch1: Meanwhile in #Washington... Almost two dozen members of #Congress write to #TheresaMay on eve of #StPatricksDay visit wa\\xe2\\x80\\xa6'

b"RT @KMTV_Kent: #KentTonight Poll:\\nKent\'s MPs will be having their say on Theresa May\'s #Brexit deal today. @SirRogerGaleMP said he\'ll back\\xe2\\x80\\xa6"

结果应如下所示： James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for' （保留标签，只删除任何utf8字符）

我想清除此推文。我尝试将regex与re.sub（my_regex），re.compile一起使用...

我尝试了不同的正则表达式：（[\ U00010000- \ U0010ffff]，r'@ [A-Za-z0-9] +'，https？：// [A-Za-z0-9./] +）

我也尝试过这样：

x.encode('ascii','ignore').decode('utf-8')

由于有两个反斜杠，所以它不起作用，并且在我这样做时起作用：

'to tell us whether or not fore\xe2\x80\xa6'.encode('ascii','ignore').decode('utf-8')

它返回我：

'to tell us whether or not fore'

有人知道如何清洁吗？非常感谢！

Answer 1

看看是否有帮助

a = 'b\'RT @LBC: James O\\\'Brien on Geoffrey Cox\\\'s awaited legal advice:     "We are waiting for a single unelected expert to tell us whether or not fore\\xe2\\x80\\xa6\''

chars = re.findall("""[\s"'#]+\w+""",a)

''.join([c for c in chars if c])

输出

James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'

清洁推文的问题（表情符号，表情符号...）

1 个答案: