我想使用推文进行情绪分析。我需要摆脱用户名和链接和附加文件,但不是标点符号和主题标签,因为我在句子级别取出极性。我正在使用以下声明
text=' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", text).split())
但上面的语句会删除所有内容并返回单词。
输入:
RT @UniversalIND: #F8 is now playing in the theaters near you! So hurry and book your tickets https://www.abcabcabc.com :D ;)
输出:
RT F8 is now playing in the theaters near you So hurry and book your tickets
必需的输出:
RT #F8 is now playing in the theaters near you! So hurry and book your tickets
有人可以提出任何建议吗?
答案 0 :(得分:1)
使用以下方法:
border_mode
输出:
text = 'RT @UniversalIND: #F8 is now playing in the theaters near you! So hurry and book your tickets https://www.abcabcabc.com'
text = re.sub(r'@\S+|https?://\S+', '', text)
print(text)
RT #F8 is now playing in the theaters near you! So hurry and book your tickets
- 匹配以@\S+|https?://\S+
开头且包含非空白字符@
的子字符串或以\S+