我要在情感分析中建立一个新项目,并希望删除任何陌生的单词,字符,电子邮件或任何带有@的名称,或任何空格,清除文本中的任何噪音
input text ="@maggieNYT KFC must be out chicken. This guy itأ?آ?أ?آ?أ?آ?s losing his shit."
或
input text ="‰??Aye babe. Why is Pizza hut calling you at 10 PM?‰?? "
或
input text ="The team will be in @KingstonLibrary tomorrow from 2:30 - 5:30pm. Providingأ?آپ#HIVأ?آپ/ #STI tests &أ?آپ#freeأ?آپcondoms, along with information & advice onأ?آپ#PrEP #contraceptionأ?آپ& otherأ?آپ#sexualhealthأ?آپissues.
答案 0 :(得分:0)
可以通过python中的re
库,使用regular expressions 来完成您要问的事情。您可以将正则表达式视为一种高级的查找和替换功能。
用户@Abijit提供了一个正则表达式,它将执行this answer中有问题的任务。
...以下正则表达式仅去除URL(不只是http),任何标点,用户名或任何非字母数字字符。它还将单词用单个空格分隔。...
这是我的建议。
' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
在示例字符串上对此进行测试,它似乎也适用于您的情况。这是我的代码。
import re # Python regex library
original: str = input()
# This following line uses @Abijit's regex
cleaned: str = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",original).split())
print(cleaned)
每个的输出如下:
KFC must be out chicken This guy it s losing his shit
Aye babe Why is Pizza hut calling you at 10 PM
The team will be in tomorrow from 2 30 5 30pm Providing HIV STI tests amp free condoms along with information amp advice on PrEP contraception amp other sexualhealth issues