Question

我正在尝试使用正则表达式捕获推文正文中的所有Twitter句柄。挑战在于我正试图获得处理

包含特定字符串
长度未知
可能会跟着
- 标点符号
- 空白
- 或字符串的结尾。

例如，对于这些字符串中的每一个，我都标记了 in italics 我想要返回的内容。

“@处理你的问题是什么？” [RETURN'@ handle“]

“你的问题是什么？@handle？” [RETURN'@ handle“]

“@ 123handle你的问题是什么？@ handle123？” [RETURN'@ 123handle'，'@ handle123']

这是我到目前为止所做的：

>>> import re
>>> re.findall(r'(@.*handle.*?)\W','hi @123handle, hello @handle123')
['@123handle']
# This misses the handles that are followed by end-of-string

我尝试修改为包含允许字符串结尾字符的or字符。相反，它只返回整个字符串。

>>> re.findall(r'(@.*handle.*?)(?=\W|$)','hi @123handle, hello @handle123')
['@123handle, hello @handle123']
# This looks like it is too greedy and ends up returning too much

如何编写满足这两个条件的表达式？

我查看了couple other个地方，但仍然卡住了。

Answer 1

您似乎正在尝试匹配以@开头的字符串，然后是0 +字字符，然后是handle，然后是0 +字字符。

使用

r'@\w*handle\w*'

或 - 避免在电子邮件中匹配@ +字词：

r'\B@\w*handle\w*'

请参阅Regex 1 demo和Regex 2 demo（\B非字边界需要非字char或字符串的开头位于@之前。

请注意，.*是一个贪婪的点匹配模式，尽可能多地匹配除换行符之外的任何字符。 \w*仅匹配0+个字符（也尽可能多）但如果未使用[a-zA-Z0-9_]标记，则来自re.UNICODE集合（并且未在您的代码中使用）。

Python demo：

import re
p = re.compile(r'@\w*handle\w*')
test_str = "@handle what is your problem?\nwhat is your problem @handle?\n@123handle what is your problem @handle123?\n"
print(p.findall(test_str))
# => ['@handle', '@handle', '@123handle', '@handle123']

Answer 2

仅匹配包含此字符范围的句柄 - ＆gt; /[a-zA-Z0-9_]/。

s = "@123handle what is your problem @handle123?"
print re.findall(r'\B(@[\w\d_]+)', s)
>>> ['@123handle', '@handle123']
s = '@The quick brown fox@jumped over the LAAZY @_dog.'
>>> ['@The', '@_dog']

python中的正则表达式匹配Twitter句柄

2 个答案: