我想从类似这样的文本中匹配组号及其行组:
domain 1
654789 text (one or more lines)
domain 2
125478 text (one or more lines)
我想得到:
domain 1 654789
domain 2 125478
我的代码是:
import re
from re import match
domain = re.compile(r'[-+]?domain')
terminal = re.compile(r'^[0-9][0-9]{6}(?!\d)')
with open('in_texto.txt') as file_in:
for linea in file_in:
for match in re.finditer(domain, linea):
dom = re.findall('\d+', linea)[0]
print(dom)
for lineas in file_in:
for match in re.finditer(terminal, lineas):
print(dom+" "+lineas, end='')
但是它仅打印:
654789 text
956478 text
125478 text
.....
我该如何解决这个问题?
答案 0 :(得分:1)
这是使用模块regex
的解决方案(实际上,使用re
的工作原理完全相同)
# import regex # or re - then subsitute regex.split for re.split etc.
# string = 'domain 1 \ntotal.....\n======= \n\n654789 text \n956478 text\ndomain 2\n======= \ncolumn..... \n\n\n125478 text \n456987 text '
domains = regex.split(r'domain \d+', string)
out = list()
for k in range(1, len(domains)):
out.extend(['domain {} {}'.format(k, d) for d in regex.findall(r'\d+(?=\s*text)', domains[k])])
out
# ['domain 1 654789', 'domain 1 956478', 'domain 2 125478', 'domain 2 456987']
\d+(?=\s*text)
获得数字。答案 1 :(得分:0)
一种方法是首先提取域。通过发现域行,然后所有文本到下一个域行,可以工作。然后,将它们分成几行,并仅提取以6位数字开头的行:
import re
terminal = re.compile(r'(\d+){6}\s+')
with open('in_texto.txt') as file_in:
for domain, lines in re.findall(r'^(domain\s+\d+?)(.*?)(?=^domain|\Z)', file_in.read(), re.M + re.S):
for line in lines.splitlines():
t = terminal.match(line)
if t:
print(domain, t.group())
这会给你类似的东西
domain 1 654789
domain 1 956478
domain 2 125478
domain 2 456987
domain 2 236512
domain 3 369852
domain 3 548723