如何从文本中删除陌生词

时间:2019-06-12 23:47:40

标签: python preprocessor

我要在情感分析中建立一个新项目,并希望删除任何陌生的单词,字符,电子邮件或任何带有@的名称,或任何空格,清除文本中的任何噪音

input text ="@maggieNYT KFC must be out chicken.  This guy itأ?آ?أ?آ?أ?آ?s losing his shit."

input text ="‰??Aye babe. Why is Pizza hut calling you at 10 PM?‰?? "

input text ="The team will be in @KingstonLibrary tomorrow from 2:30 - 5:30pm. Providingأ?آپ#HIVأ?آپ/ #STI tests &أ?آپ#freeأ?آپcondoms, along with information & advice onأ?آپ#PrEP #contraceptionأ?آپ& otherأ?آپ#sexualhealthأ?آپissues.

1 个答案:

答案 0 :(得分:0)

可以通过python中的re库,使用regular expressions 来完成您要问的事情。您可以将正则表达式视为一种高级的查找和替换功能。

用户@Abijit提供了一个正则表达式,它将执行this answer中有问题的任务。

  

...以下正则表达式仅去除URL(不只是http),任何标点,用户名或任何非字母数字字符。它还将单词用单个空格分隔。...

     

这是我的建议。

     
' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())

在示例字符串上对此进行测试,它似乎也适用于您的情况。这是我的代码。

import re  # Python regex library
original: str = input()
# This following line uses @Abijit's regex
cleaned: str = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",original).split())
print(cleaned)

每个的输出如下:

  • KFC must be out chicken This guy it s losing his shit
  • Aye babe Why is Pizza hut calling you at 10 PM
  • The team will be in tomorrow from 2 30 5 30pm Providing HIV STI tests amp free condoms along with information amp advice on PrEP contraception amp other sexualhealth issues