RegEx根据python中的不同标准拆分字符串

时间:2015-02-03 13:39:22

标签: python regex

我想使用正则表达式拆分字符串。

代表

when [python] or [html ] demo  "css html"   -[javascript] score:5

从我想要的这个字符串,跟随列表,

contains = ['when', 'demo']
word_press = ["css html"]
tags = ['python', 'or', 'html', '-', 'javascript']
options = [{score:5}]
  • "[]"(括号)中的所有字词都是标记列表。
  • ""之间的单词将出现在word_press列表中。
  • 单词中包含:的单词,它将出现在选项列表中。
  • 其他上述标准将包含在列表中。

我试过这个,

((?:or\s|-)?\[.*?\])|(".*?")|([a-z]+:\d*)|(\S+)

live demo

它工作正常,但我用它与python

>>> import re
>>> s = '''[python] or [html] how to "how to" user:2525
... [demo] how to createscore:5
... when [python] or [html] demo  "css html"   -[javascript] score:5'''
>>> re.findall('''((?:or\s|-)?\[.*?\])|(".*?")|([a-z]+:\d*)|(\S+)''', s)
[('[python]', '', '', ''),
 ('or [html]', '', '', ''),
 ('', '', '', 'how'),
 ('', '', '', 'to'),
 ('', '"how to"', '', ''),
 ('', '', 'user:2525', ''),
 ('[demo]', '', '', ''),
 ('', '', '', 'how'),
 ('', '', '', 'to'),
 ('', '', 'createscore:5', ''),
 ('', '', '', 'when'),
 ('[python]', '', '', ''),
 ('or [html]', '', '', ''),
 ('', '', '', 'demo'),
 ('', '"css html"', '', ''),
 ('-[javascript]', '', '', ''),
 ('', '', 'score:5', '')]

它返回列表中的元组。有没有办法获取像

这样的群组
group1 = ['[python]', 'or [html]', '[demo]', '[python]', 'or [html]', '-[javascript]']
...

1 个答案:

答案 0 :(得分:1)

>>> import re
>>> s = '''[python] or [html] how to "how to" user:2525
[demo] how to createscore:5
when [python] or [html] demo  "css html"   -[javascript] score:5'''

以下是一个可能的正则表达式(包括内联注释),用于捕获所需的信息(请参阅演示here):

>>> pattern = r'''
    (?P<tag>                 # define group one - tags
    (?:or\s|-)?              # - acceptable words/chars for preceding tags
    \[.*?\])                 # - tag definition - words in square brackets
    |(?P<word_press>".*?")   # group two - words in quotes
    |(?P<options>[a-z]+:\d*) # group three - options with colons
    |(?P<other>\S+)          # group four - anything left over
'''

请注意,将此项与findall一起使用会为您提供元组列表:

>>> re.findall(pattern, s, re.VERBOSE)
[('[python]', '', '', ''),
 ('or [html]', '', '', ''),
 ('', '', '', 'how'), 
 ('', '', '', 'to'),
 ('', '"how to"', '', ''),
 ('', '', 'user:2525', ''), 
 ('[demo]', '', '', ''),
 ('', '', '', 'how'),
 ('', '', '', 'to'), 
 ('', '', 'createscore:5', ''),
 ('', '', '', 'when'),
 ('[python]', '', '', ''), 
 ('or [html]', '', '', ''), 
 ('', '', '', 'demo'), 
 ('', '"css html"', '', ''), 
 ('-[javascript]', '', '', ''), 
 ('', '', 'score:5', '')]

但这是一种重新排列它的功能编程方式:

>>> from functools import partial
>>> map(partial(filter, None), zip(*re.findall(pattern, s, re.VERBOSE)))
[('[python]', 'or [html]', '[demo]', '[python]', 'or [html]', '-[javascript]'), 
 ('"how to"', '"css html"'), 
 ('user:2525', 'createscore:5', 'score:5'), 
 ('how', 'to', 'how', 'to', 'when', 'demo')]