我在"官方"遇到了麻烦。解析twitter的提及的正则表达式。 (https://github.com/twitter/twitter-text/blob/master/java/src/com/twitter/Regex.java)
这是我的代码:
AT_SIGNS_CHARS = u"@\uFF20"
AT_SIGNS = "[" + AT_SIGNS_CHARS + "]"
mention_pattern = u"([^a-z0-9_!#$%&*" + AT_SIGNS_CHARS + "]|^|(?:^|[^a-z0-9_+~.-])RT:?)(" + AT_SIGNS + "+)([a-z0-9_]{1,20})(/[a-z][a-z0-9_\\-]{0,24})?"
patt = re.compile(mention_pattern)
pr = '@ciao bella'
print patt.findall(pr)
为什么要打印:
[('', '@', 'ciao', '')]
而不是:
['@ciao']
提前谢谢
答案 0 :(得分:0)
如果您使用match
并打印group(0)
,这是完全匹配而不考虑不同的组,您将看到所需的结果:
>>> patt.match(pr).group(0)
'@ciao'
如果要使用findall
查找多个结果,可以在模式周围添加一组额外的括号,以便完整匹配是返回列表中每个元组的第一个元素:
AT_SIGNS_CHARS = u"@\uFF20"
AT_SIGNS = "[" + AT_SIGNS_CHARS + "]"
mention_pattern = u"(([^a-z0-9_!#$%&*" + AT_SIGNS_CHARS + "]|^|(?:^|[^a-z0-9_+~.-])RT:?)(" + AT_SIGNS + "+)([a-z0-9_]{1,20})(/[a-z][a-z0-9_\\-]{0,24})?)"
patt = re.compile(mention_pattern)
pr = '@ciao bella :-), 123 @adeu guapa'
>>> print patt.findall(pr)
[('@ciao', '', '@', 'ciao', ''), (' @adeu', ' ', '@', 'adeu', '')]
编辑:查找提及加拉丁语
AT_SIGNS_CHARS = u"@\uFF20"
AT_SIGNS = "[" + AT_SIGNS_CHARS + "]"
mention_pattern = u"([^a-z0-9_!#$%&*" + AT_SIGNS_CHARS + "]|^|(?:^|[^a-z0-9_+~.-])RT:?)(" + AT_SIGNS + "+)([a-z0-9_]{1,20})(/[a-z][a-z0-9_\\-]{0,24})?"
latin = u'[a-zA-Z]+'
mention_or_latin = "(%s|%s)" % (mention_pattern, latin)
patt = re.compile(mention_or_latin)
pr = '@ciao bella, @adeu guapa'
>>> print patt.findall(pr)
[('@ciao', '', '@', 'ciao', ''), ('bella', '', '', '', ''), (' @adeu', ' ', '@', 'adeu', ''), ('guapa', '', '', '', '')]