推特的正则表达式提及

时间:2017-03-29 12:36:11

标签: python regex twitter

我在"官方"遇到了麻烦。解析twitter的提及的正则表达式。 (https://github.com/twitter/twitter-text/blob/master/java/src/com/twitter/Regex.java

这是我的代码:

AT_SIGNS_CHARS = u"@\uFF20"
AT_SIGNS = "[" + AT_SIGNS_CHARS + "]"
mention_pattern = u"([^a-z0-9_!#$%&*" + AT_SIGNS_CHARS + "]|^|(?:^|[^a-z0-9_+~.-])RT:?)(" + AT_SIGNS + "+)([a-z0-9_]{1,20})(/[a-z][a-z0-9_\\-]{0,24})?"

patt = re.compile(mention_pattern)

pr = '@ciao bella'
print patt.findall(pr)

为什么要打印:

[('', '@', 'ciao', '')]

而不是:

['@ciao']

提前谢谢

1 个答案:

答案 0 :(得分:0)

如果您使用match并打印group(0),这是完全匹配而不考虑不同的组,您将看到所需的结果:

>>> patt.match(pr).group(0)
'@ciao'

如果要使用findall查找多个结果,可以在模式周围添加一组额外的括号,以便完整匹配是返回列表中每个元组的第一个元素:

AT_SIGNS_CHARS = u"@\uFF20"
AT_SIGNS = "[" + AT_SIGNS_CHARS + "]"
mention_pattern = u"(([^a-z0-9_!#$%&*" + AT_SIGNS_CHARS + "]|^|(?:^|[^a-z0-9_+~.-])RT:?)(" + AT_SIGNS + "+)([a-z0-9_]{1,20})(/[a-z][a-z0-9_\\-]{0,24})?)"

patt = re.compile(mention_pattern)

pr = '@ciao bella :-), 123 @adeu guapa'

>>> print patt.findall(pr)

[('@ciao', '', '@', 'ciao', ''), (' @adeu', ' ', '@', 'adeu', '')]

编辑:查找提及加拉丁语

AT_SIGNS_CHARS = u"@\uFF20"
AT_SIGNS = "[" + AT_SIGNS_CHARS + "]"
mention_pattern = u"([^a-z0-9_!#$%&*" + AT_SIGNS_CHARS + "]|^|(?:^|[^a-z0-9_+~.-])RT:?)(" + AT_SIGNS + "+)([a-z0-9_]{1,20})(/[a-z][a-z0-9_\\-]{0,24})?"
latin = u'[a-zA-Z]+'
mention_or_latin = "(%s|%s)" % (mention_pattern, latin)
patt = re.compile(mention_or_latin)

pr = '@ciao bella, @adeu guapa'

>>> print patt.findall(pr)
[('@ciao', '', '@', 'ciao', ''), ('bella', '', '', '', ''), (' @adeu', ' ', '@', 'adeu', ''), ('guapa', '', '', '', '')]