Question

我有一个包含数百万转推的文件 - 像这样：

RT @Username: Text_of_the_tweet

我只需要从这个字符串中提取用户名。由于我在正则表达式方面总共为零，不久前我在这里建议使用

username = re.findall('@([^:]+)', retweet)

这在大多数情况下效果很好，但有时我得到这样的行：

RT @ReutersAero: Further pictures from the #MH17 crash site in  in Grabovo, #Ukraine #MH17 - @reuterspictures (GRAPHIC): http://t.co/4rc7Y4…

我只需要字符串中的“ReutersAero”，但由于它包含另一个“@”和“：”，它会混淆正则表达式，我得到了这个输出：

['ReutersAero', 'reuterspictures (GRAPHIC)']

有没有办法只将正则表达式用于它在字符串中找到的第一个实例？

Answer 1

你可以使用这样的正则表达式：

RT @(\w+):

<强> Working demo

enter image description here

匹配信息：

MATCH 1
1.  [4-15]  `ReutersAero`
MATCH 2
1.  [145-156]   `AnotherAero`

您可以使用此 python 代码：

import re
p = re.compile(ur'RT @(\w+):')
test_str = u"RT @ReutersAero: Further pictures from the #MH17 crash site in  in Grabovo, #Ukraine #MH17 - @reuterspictures (GRAPHIC): http://t.co/4rc7Y4…\nRT @AnotherAero: Further pictures from the #MH17 crash site in  in Grabovo, #Ukraine #MH17 - @reuterspictures (GRAPHIC): http://t.co/4rc7Y4…\n"

re.findall(p, test_str)

Answer 2

有没有办法只将正则表达式用于它在字符串中找到的第一个实例？

请勿使用findall，而是使用search。

在Python中重复正则表达式模式

2 个答案: