正则表达式提取多行哈希注释

时间:2015-05-06 20:18:39

标签: python regex

目前,作家阻止试图为这个问题提出一个优雅的解决方案。

采用以下示例:

{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
  }
}

从上面开始,我想将代码注释作为一个整体提取出来,而不是单独提取。如果一行在另一行之后被注释,则会发生这种分组。注释将始终以空格开头,后跟#。

示例结果:

Capture group 1: Some information about field 1\n on multiple lines
Capture group 2: Some more info on a single line

我可以跨越行并在代码中进行评估,但如果可能的话,使用正则表达式会更好。如果您认为正则表达式不是解决此问题的正确方法,请解释原因。

内容:

感谢大家提交各种解决方案来解决这个问题,这是SO社区有用的一个很好的例子。我将花费一个小时的时间来回答其他门票,以弥补在此上花费的集体时间。

希望这个帖子将来也会帮助其他人。

4 个答案:

答案 0 :(得分:2)

您可以将re.findall与以下正则表达式一起使用:

>>> m= re.findall(r'\s*#(.*)\s*#(.*)|#(.*)[^#]*',s,re.MULTILINE)
[(' Some information about field 1', ' on multiple lines', ''), ('', '', ' Some more info on a single line')]

对于打印,你可以这样做:

>>> for i,j in enumerate(m):
...   print ('group {}:{}'.format(i," & ".join([i for i in j if i])))
... 
group 0: Some information about field 1 &  on multiple lines
group 1: Some more info on a single line

但作为评论的更通用方式,你可以使用itertools.groupby

s="""{
  "data": {
    # Some information about field 1
    # on multiple lines
    # threeeeeeeeecomment
    "field1": "XXXXXXXXXX"

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
  }
}"""
from itertools import groupby

comments =[[i for i in j if i.strip().startswith('#')] for _,j in groupby(s.split('\n'),lambda x: x.strip().startswith('#'))]

for i,j in enumerate([m for m in comments if m],1):
        l=[t.strip(' #') for t in j]
        print 'group {} :{}'.format(i,' & '.join(l))

结果:

group 1 :Some information about field 1 & on multiple lines & threeeeeeeeecomment
group 2 :Some more info on a single line

答案 1 :(得分:1)

让我们假设,例如,您想要使用单个正则表达式从每行的多行字符串中获取一些特定数据(例如,hashtags):

#!/usr/bin/env python
# coding: utf-8

import re

# the regexp isn't 100% accurate, but you'll get the point
# groups followed by '?' match if repeated 0 or 1 times.
regexp = re.compile('^.*(#[a-z]*).*(#[a-z]*)?$')

multiline_string = '''
                     The awesomeness of #MotoGP is legendary. #Bikes rock!
                     Awesome racing car #HeroComesHome epic
'''

iterable_list = multiline_string.splitlines()

for line in iterable_list:
    '''
    Keep in mind:   if group index is out of range,
                    execution will crash with an error.
                    You can prevent it with try/except blocks
    '''
    fragments = regexp.match(line)
    frag_in_str = fragments.group(1)

    # Example to prevent a potential IndexError:
    try:
        some_other_subpattern = fragments.group(2)
    except IndexError:
        some_other_subpattern = ''

    entire_match = fragments.group(0)

括号内的每个组都可以这样提取。

这里发布了一个否定模式的好例子: How to negate specific word in regex?

答案 2 :(得分:1)

您可以使用deque保留两行并添加一些逻辑来对块中的注释进行分区:

src='''\
{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",


    # multiple line comments
    # supported
    # as well 
    "field3": "#this would be ignored"

  }
}
'''

from collections import deque
d=deque([], 2)
blocks=[]
for line in src.splitlines():
    d.append(line.strip())
    if d[-1].startswith('#'):        
        comment=line.partition('#')[2]
        if d[0].startswith('#'):
            block.append(comment)
        else:
            block=[comment]
    elif d[0].startswith('#'):
        blocks.append(block)

for i, b in enumerate(blocks):
    print 'block {}: \n{}'.format(i, '\n'.join(b))  

打印:

block 0: 
 Some information about field 1
 on multiple lines
block 1: 
 Some more info on a single line
block 2: 
 multiple line comments
 supported
 as well 

答案 3 :(得分:1)

纯粹使用正则表达式是不可能做到的,但是你可以通过一个单行程来逃避)

import re

str = """{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX"
    # Some information about field 1
    # on multiple lines
    # Some information about field 1
    # on multiple lines
    "field3": "#this would be ignored"
  }
}"""

rex = re.compile("(^(?!\s*#.*?[\r\n]+)(.*?)([\r\n]+|$)|[\r\n]*^\s*#\s*)+", re.MULTILINE)    
print rex.sub("\n", str).strip().split('\n\n')

输出:

['Some information about field 1\non multiple lines', 'Some more info on a single line', 'Some information about field 1\non multiple lines\nSome information about field 1\non multiple lines']