Question

我有一个看起来像这样的数据集，

"See the new #Gucci 5th Ave NY windows customized by @troubleandrew for the debut of the #GucciGhost collection."
"Before the #GucciGhost collection debuts tomorrow, read about the artist @troubleandrew"

所以我试图摆脱所有@和附加的词。我的数据集看起来应该是这样的。

"See the new #Gucci 5th Ave NY windows customized by for the debut of the #GucciGhost collection."
    "Before the #GucciGhost collection debuts tomorrow, read about the artist"

所以我可以使用简单的替换语句来摆脱@。但相邻的词是一个问题。

我正在使用re来搜索/查找事件。但我无法删除这个词。

P.S - 如果只是一个单词，那就不会有问题了。但是我的数据集中有多个单词附加到@

Answer 1

您可以使用正则表达式

import re

a = [ 
"See the new #Gucci 5th Ave NY windows customized by @troubleandrew for the debut of the #GucciGhost collection.",
"Before the #GucciGhost collection debuts tomorrow, read about the artist @troubleandrew"
]
pat = re.compile(r"@\S+") # \S+ all non-space characters
for i in range(len(a)):
    a[i] = re.sub(pat, "", a[i]) # replace it with empty string
print a

这会给你你想要的东西。

Answer 2

惯用版，替代额外空间：

import re

a = [
    "See the new #Gucci 5th Ave NY windows customized by @troubleandrew for the debut of the #GucciGhost collection.",
    "Before the #GucciGhost collection debuts tomorrow, read about the artist @troubleandrew"
]

rgx = re.compile(r"\s?@\S+")

b = [ re.sub(rgx, "", row) for row in a ]

print b

\s?：\s匹配' '，?代表zero or one出现。

查找子字符串并使用regex删除它，python

2 个答案: