我试图从输入中捕获数据,如:
...
10 79 QUANT. DE ITENS A FORNECER O N 9 0 67 75
E' a quantidade de itens que o fornecedor consegue suprir
o cliente para uma determinada data. As casa decimais estao
definidas no campo 022 (unid. casas decimais).
11 24 DATA ENTREGA/EMBARQUE DO ITEM O N 6 0 76 81
Data de entrega/embarque do item. Nos casos em que este cam-
po nao contiver a data, seu conteudo devera ser ajustado en-
tre as partes.
...
我的目标是捕获: (' 10',' 79',' QUANT.DE ITENS A FORNECER' O' N' N' N' N' N' ,' 9',' 0',' 67',75')依此类推......
我的第一次尝试是循环上线并捕获如下:
def parse_line(line):
pattern = r"\s(\d{1,6}|\w{1})\s" # do not capture the description
if re.search(pattern, line):
tab_find = re.findall(pattern, line, re.DOTALL|re.UNICODE)
if len(tab_find) > 6:
return tab_find
我的第二次尝试是拆分文本并附加预期结果:
def ugly_parsing(line):
result = [None] * 9 # init list
tab_r = list(filter(None, re.split(r"\s", line))) # ignore ''
keys = [0, 1, -1, -2, -3, -4, -5, -6]
for i in keys:
result[i] = tab_r[i]
result[2] = " ".join(tab_r[2:-6])
return result
忽略描述是可以的,但是当描述包含单个字母时,我的正则表达式无效。
答案 0 :(得分:2)
只需将该行转换为正则表达式,包含所有必需的数字和字符,并为描述提供剩余的内容。您可以使用非贪婪的匹配执行此操作:(.+?)
。
p = re.compile(r"^(\d+)\s+(\d+)\s+(.+?)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)$")
for line in text.splitlines():
m = p.match(line)
if m:
print m.groups()
输出
('10', '79', 'QUANT. DE ITENS A FORNECER', 'O', 'N', '9', '0', '67', '75')
('11', '24', 'DATA ENTREGA/EMBARQUE DO ITEM', 'O', 'N', '6', '0', '76', '81')
不确定这是否会使其更具可读性,但您也可以从较小的部分构建大型正则表达式,例如"^" + r"(\d+)\s+" * 2 + "(.+?)" + r"\s+(\w+)" * 6 + "$"
或"^" + "\s+".join([r"(\d+)"] * 2 + ["(.+?)"] + [r"(\w+)"] * 6) + "$"
或者,根据您的输入,您可以通过除单个空格之外的其他内容进行拆分,例如两个或多个空格\s{2,}
(如注释中所示)或制表符,但这可能会产生问题,以防万一描述也包含这些。使用固定数量的东西"周围"描述可能更可靠。
答案 1 :(得分:1)
给出如下文件:
$ cat /tmp/test.txt
10 79 QUANT. DE ITENS A FORNECER O N 9 0 67 75
E' a quantidade de itens que o fornecedor consegue suprir
o cliente para uma determinada data. As casa decimais estao
definidas no campo 022 (unid. casas decimais).
11 24 DATA ENTREGA/EMBARQUE DO ITEM O N 6 0 76 81
Data de entrega/embarque do item. Nos casos em que este cam-
po nao contiver a data, seu conteudo devera ser ajustado en-
tre as partes.
如果要捕获描述,可以将mmap与正则表达式一起使用,并逐块捕获文件。
示例:
import re
import mmap
block_pattern=re.compile(r'^(\d+\s+\d+\s+.*?)(?=(?:^\s*$)|\Z)', flags=re.S | re.M)
data_pattern=re.compile(r'^(\d+)\s+(\d+)\s+(.*?)\s+(\w)\s+(\w)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$')
with open(fn) as f:
txt=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
for block in block_pattern.finditer(txt):
block_lines=block.group(0).partition('\n')
m=data_pattern.search(block_lines[0])
if m:
block_data=[m.groups(), block_lines[2]]
print block_data
打印:
[('10', '79', 'QUANT. DE ITENS A FORNECER', 'O', 'N', '9', '0', '67', '75'), " \x0fE' a quantidade de itens que o fornecedor consegue suprir\n \x0fo cliente para uma determinada data. As casa decimais estao \n \x0fdefinidas no campo 022 (unid. casas decimais). \n"]
[('11', '24', 'DATA ENTREGA/EMBARQUE DO ITEM', 'O', 'N', '6', '0', '76', '81'), ' \x0fData de entrega/embarque do item. Nos casos em que este cam-\n \x0fpo nao contiver a data, seu conteudo devera ser ajustado en-\n \x0ftre as partes. \n']
正如评论中所述,this regex非常接近你想要的。
答案 2 :(得分:0)
感谢你们所有人。
该脚本的目标是解析txt文件[link] [1]
[1]:http://www.anfavea.com.br/rnd/006.TXT RND定义文件到python结构为:
recorddefs = [{'ITP': [['1', '1', 'IDENT. REGISTRATION TYPE', 'M', 'A', '3', '0', '1', '3'],
['2', '33', 'IDENTIFICATION OF THE PROCESS', 'M', 'N', '3', '0', '4', '6'],
...]},
'RP1': [['1', '1', 'IDENT. REGISTRATION TYPE', 'M', 'A', '3', '0', '1', '3'],
['2', '2', 'COD. DESTINATION FACTORY', 'M', 'A', '3', '0', '4', '6'],
...]},
'RP2': [['1', '1', 'IDENT. REGISTRATION TYPE', 'M', 'A', '3', '0', '1', '3'],
['2', '24', 'DATA DELIVERY / SHIPMENT OF THE ITEM', 'M', 'N', '6', '0', '4', '9'],
['3', '25', 'QT DELIVERY / SHIPMENT OF THE ITEM', 'M', 'N', '9', '0', '10', '18'],
...,]}]
每个块由代码(3位数)标识,并包含属于它的所有元素的描述(块==段)。
现在我的(只是代码片段)代码如下:
def parse_file(filename):
with contextlib.suppress(StopIteration):
with open(filename) as fin:
while True:
line = next(fin)
if "LAYOUT DE REGISTRO" in line:
yield parse_segment_block(fin)
def parse_segment_block(fin_iter):
r = defaultdict(list)
k = None
while True:
line = next(fin_iter)
if re.search(r"\s(\w{3})\s", line) and not k:
k = re.search(r"\s(\w{3})\s", line).group(1)
tab_parser = parse_line(line)
if tab_parser:
r[k].append(tab_parser)
if "Rede Nacional de Dados" in line:
return r
def parse_line(line):
line = line.strip()
p = re.compile(r"^(\d+)\s+(\d+)\s+(.+?)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)$")
m = p.match(line.strip())
if m:
result = list(m.groups())
result[2] = translate(result[2]) # google translate call
return result
考虑到上述反应。根据@ dawn的回复,是否有可能拥有全球搜索模式?