正则表达式捕获不同类型的模式

时间:2015-09-07 15:20:56

标签: python regex

我试图从输入中捕获数据,如:

...
10   79    QUANT. DE ITENS A FORNECER       O    N     9    0   67  75
           E' a quantidade  de  itens  que o fornecedor consegue suprir
           o cliente para uma determinada data. As casa decimais estao 
           definidas no campo 022 (unid. casas decimais).              

11   24    DATA ENTREGA/EMBARQUE DO ITEM    O    N     6    0   76  81
           Data de entrega/embarque do item. Nos casos em que este cam-
           po nao contiver a data, seu conteudo devera ser ajustado en-
           tre as partes. 
...

我的目标是捕获: (' 10',' 79',' QUANT.DE ITENS A FORNECER' O' N' N' N' N' N' ,' 9',' 0',' 67',75')依此类推......

我的第一次尝试是循环上线并捕获如下:

def parse_line(line):
    pattern = r"\s(\d{1,6}|\w{1})\s" # do not capture the description
    if re.search(pattern, line):
        tab_find = re.findall(pattern, line, re.DOTALL|re.UNICODE)
        if len(tab_find) > 6:
            return tab_find

我的第二次尝试是拆分文本并附加预期结果:

def ugly_parsing(line):
    result = [None] * 9 # init list
    tab_r = list(filter(None, re.split(r"\s", line))) # ignore '' 
    keys = [0, 1, -1, -2, -3, -4, -5, -6]
    for i in keys:
        result[i] = tab_r[i]
    result[2] = " ".join(tab_r[2:-6])
    return result

忽略描述是可以的,但是当描述包含单个字母时,我的正则表达式无效。

3 个答案:

答案 0 :(得分:2)

只需将该行转换为正则表达式,包含所有必需的数字和字符,并为描述提供剩余的内容。您可以使用非贪婪的匹配执行此操作:(.+?)

p = re.compile(r"^(\d+)\s+(\d+)\s+(.+?)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)$")
for line in text.splitlines():
    m = p.match(line)
    if m:
        print m.groups()

输出

('10', '79', 'QUANT. DE ITENS A FORNECER', 'O', 'N', '9', '0', '67', '75')
('11', '24', 'DATA ENTREGA/EMBARQUE DO ITEM', 'O', 'N', '6', '0', '76', '81')

不确定这是否会使其更具可读性,但您也可以从较小的部分构建大型正则表达式,例如"^" + r"(\d+)\s+" * 2 + "(.+?)" + r"\s+(\w+)" * 6 + "$""^" + "\s+".join([r"(\d+)"] * 2 + ["(.+?)"] + [r"(\w+)"] * 6) + "$"

或者,根据您的输入,您可以通过除单个空格之外的其他内容进行拆分,例如两个或多个空格\s{2,}(如注释中所示)或制表符,但这可能会产生问题,以防万一描述也包含这些。使用固定数量的东西"周围"描述可能更可靠。

答案 1 :(得分:1)

给出如下文件:

$ cat /tmp/test.txt
10   79    QUANT. DE ITENS A FORNECER       O    N     9    0   67  75
           E' a quantidade  de  itens  que o fornecedor consegue suprir
           o cliente para uma determinada data. As casa decimais estao 
           definidas no campo 022 (unid. casas decimais).              

11   24    DATA ENTREGA/EMBARQUE DO ITEM    O    N     6    0   76  81
           Data de entrega/embarque do item. Nos casos em que este cam-
           po nao contiver a data, seu conteudo devera ser ajustado en-
           tre as partes. 

如果要捕获描述,可以将mmap与正则表达式一起使用,并逐块捕获文件。

示例:

import re
import mmap
block_pattern=re.compile(r'^(\d+\s+\d+\s+.*?)(?=(?:^\s*$)|\Z)', flags=re.S | re.M)
data_pattern=re.compile(r'^(\d+)\s+(\d+)\s+(.*?)\s+(\w)\s+(\w)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$')
with open(fn) as f:
    txt=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    for block in block_pattern.finditer(txt):
        block_lines=block.group(0).partition('\n')
        m=data_pattern.search(block_lines[0])
        if m:
            block_data=[m.groups(), block_lines[2]]
            print block_data

打印:

[('10', '79', 'QUANT. DE ITENS A FORNECER', 'O', 'N', '9', '0', '67', '75'), "           \x0fE' a quantidade  de  itens  que o fornecedor consegue suprir\n           \x0fo cliente para uma determinada data. As casa decimais estao \n           \x0fdefinidas no campo 022 (unid. casas decimais).              \n"]
[('11', '24', 'DATA ENTREGA/EMBARQUE DO ITEM', 'O', 'N', '6', '0', '76', '81'), '           \x0fData de entrega/embarque do item. Nos casos em que este cam-\n           \x0fpo nao contiver a data, seu conteudo devera ser ajustado en-\n           \x0ftre as partes. \n']

正如评论中所述,this regex非常接近你想要的。

答案 2 :(得分:0)

感谢你们所有人。

该脚本的目标是解析txt文件[link] [1]

[1]:http://www.anfavea.com.br/rnd/006.TXT RND定义文件到python结构为:

    recorddefs = [{'ITP': [['1', '1', 'IDENT. REGISTRATION TYPE', 'M', 'A', '3', '0', '1', '3'],
                   ['2', '33', 'IDENTIFICATION OF THE PROCESS', 'M', 'N', '3', '0', '4', '6'],
                           ...]},
          'RP1': [['1', '1', 'IDENT. REGISTRATION TYPE', 'M', 'A', '3', '0', '1', '3'],
                  ['2', '2', 'COD. DESTINATION FACTORY', 'M', 'A', '3', '0', '4', '6'],
...]},
          'RP2': [['1', '1', 'IDENT. REGISTRATION TYPE', 'M', 'A', '3', '0', '1', '3'],
                  ['2', '24', 'DATA DELIVERY / SHIPMENT OF THE ITEM', 'M', 'N', '6', '0', '4', '9'],
                  ['3', '25', 'QT DELIVERY / SHIPMENT OF THE ITEM', 'M', 'N', '9', '0', '10', '18'],
...,]}]

每个块由代码(3位数)标识,并包含属于它的所有元素的描述(块==段)。

现在我的(只是代码片段)代码如下:

def parse_file(filename):
    with contextlib.suppress(StopIteration):
        with open(filename) as fin:
            while True:
                line = next(fin)
                if "LAYOUT DE REGISTRO" in line:
                    yield parse_segment_block(fin)


def parse_segment_block(fin_iter):
    r = defaultdict(list)
    k = None
    while True:
        line = next(fin_iter)
        if re.search(r"\s(\w{3})\s", line) and not k:
            k = re.search(r"\s(\w{3})\s", line).group(1)
        tab_parser = parse_line(line)
        if tab_parser:
            r[k].append(tab_parser)
        if "Rede Nacional de Dados" in line:
            return r


def parse_line(line):
    line = line.strip()
    p = re.compile(r"^(\d+)\s+(\d+)\s+(.+?)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)$")
    m = p.match(line.strip())
    if m:
        result = list(m.groups())
        result[2] = translate(result[2]) # google translate call
        return result

考虑到上述反应。根据@ dawn的回复,是否有可能拥有全球搜索模式?