查找子字符串并使用regex删除它,python

时间:2016-09-15 08:59:12

标签: regex python-2.7

我有一个看起来像这样的数据集,

"See the new #Gucci 5th Ave NY windows customized by @troubleandrew for the debut of the #GucciGhost collection."
"Before the #GucciGhost collection debuts tomorrow, read about the artist @troubleandrew"

所以我试图摆脱所有@和附加的词。我的数据集看起来应该是这样的。

"See the new #Gucci 5th Ave NY windows customized by for the debut of the #GucciGhost collection."
    "Before the #GucciGhost collection debuts tomorrow, read about the artist"

所以我可以使用简单的替换语句来摆脱@。但相邻的词是一个问题。

我正在使用re来搜索/查找事件。但我无法删除这个词。

P.S - 如果只是一个单词,那就不会有问题了。但是我的数据集中有多个单词附加到@

2 个答案:

答案 0 :(得分:2)

您可以使用正则表达式

import re

a = [ 
"See the new #Gucci 5th Ave NY windows customized by @troubleandrew for the debut of the #GucciGhost collection.",
"Before the #GucciGhost collection debuts tomorrow, read about the artist @troubleandrew"
]
pat = re.compile(r"@\S+") # \S+ all non-space characters
for i in range(len(a)):
    a[i] = re.sub(pat, "", a[i]) # replace it with empty string
print a

这会给你你想要的东西。

答案 1 :(得分:0)

惯用版,替代额外空间:

import re

a = [
    "See the new #Gucci 5th Ave NY windows customized by @troubleandrew for the debut of the #GucciGhost collection.",
    "Before the #GucciGhost collection debuts tomorrow, read about the artist @troubleandrew"
]

rgx = re.compile(r"\s?@\S+")

b = [ re.sub(rgx, "", row) for row in a ]

print b

\s?\s匹配' '?代表zero or one出现。