如何使用正则表达式清除推文而不删除标点和hasthag?

时间:2017-04-22 08:56:05

标签: regex twitter

我想使用推文进行情绪分析。我需要摆脱用户名和链接和附加文件,但不是标点符号和主题标签,因为我在句子级别取出极性。我正在使用以下声明

text=' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", text).split())

但上面的语句会删除所有内容并返回单词。

输入:

RT @UniversalIND: #F8 is now playing in the theaters near you! So hurry and book your tickets https://www.abcabcabc.com :D ;)

输出:

RT F8 is now playing in the theaters near you So hurry and book your tickets

必需的输出:

RT #F8 is now playing in the theaters near you! So hurry and book your tickets

有人可以提出任何建议吗?

1 个答案:

答案 0 :(得分:1)

使用以下方法:

border_mode

输出:

text = 'RT @UniversalIND: #F8 is now playing in the theaters near you! So hurry and book your tickets https://www.abcabcabc.com'
text = re.sub(r'@\S+|https?://\S+', '', text)

print(text)

RT #F8 is now playing in the theaters near you! So hurry and book your tickets - 匹配以@\S+|https?://\S+开头且包含非空白字符@的子字符串或以\S+

开头的链接(网址)