Question

我是python和一般编程的新手。我有一个文本文件，其中包含一些url / @ /＃等，我希望将其删除以获取干净的文本数据以输入到机器学习算法中。例如以下文本数据，

@Su2ieQ13 But you're IMing with meeeeee. 
"@apogeum whoooaa, thats soo awesome  my eyes look like black.. except if you have a yellow light bulb close to my eyes then u can"
The shop of the day  http://
"i couldn't sleep so i stayed awake watching @lilbsuremusic on this live stream thingy and now i'm taking my butt to bed, so sweet dreams "
@Lee_Knight ok haha thanks i will try that lol

我编写了如下代码，

import re
import string

# load text negative
filename_neg = '/path/to/my/text_file'
file = open(filename_neg, encoding="ISO-8859-1")
text_neg = file.read()
text_neg = re.sub(r'^https?:\/\/.*[\r\n]*', '', text_neg,flags=re.MULTILINE)
file.close()
# split into words by white space
words_neg = text_neg.split()
print(words_neg)

但是仍然无法删除url等。如果有人可以帮助我解决此问题，我将不胜感激。谢谢。

Answer 1

text_neg = re.sub('@|http://|"', '', text_neg,flags=re.MULTILINE)。

要删除的符号应用|分隔。

Answer 2

对于您的问题，您可以尝试执行以下操作：

text_neg= re.sub('(http://|https://)\S*','',text_neg)
text_neg= re.sub('@\S*','',text_neg)
text_neg= re.sub('#\S*','',text_neg)

让我知道是否有帮助！

删除python中的url / @ etc

2 个答案: