我试图找到一种有效的方法来解析包含固定宽度线的文件。例如,前20个字符代表一列,从21:30代表另一个,依此类推。
假设该行包含100个字符,那么将一行解析为多个组件的有效方法是什么?
我可以在每行使用字符串切片,但如果线条很大则有点难看。还有其他快速方法吗?
答案 0 :(得分:63)
我不确定这是否有效,但它应该是可读的(而不是手动切片)。我定义了一个函数slices
,它获取字符串和列的长度,并返回子字符串。我把它变成了一个生成器,所以对于很长的行,它不构建一个临时的子串列表。
def slices(s, *args):
position = 0
for length in args:
yield s[position:position + length]
position += length
实施例
In [32]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2))
Out[32]: ['ab']
In [33]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50))
Out[33]: ['ab', 'cdefghijkl', 'mnopqrstuvwxyz0123456789']
In [51]: d,c,h = slices('dogcathouse', 3, 3, 5)
In [52]: d,c,h
Out[52]: ('dog', 'cat', 'house')
但是我认为如果你需要同时使用所有列,那么生成器的优势就会丢失。一个人可以从中受益的地方就是你想逐个处理列,比如循环。
答案 1 :(得分:58)
使用Python标准库的struct
模块既简单又快速,因为它是用C语言编写的。
以下是如何使用它来做你想要的。它还允许通过为字段中的字符数指定负值来跳过字符列。
import struct
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))
输出:
fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
以下修改将使其适用于Python 2或3(并处理Unicode输入):
import sys
fieldstruct = struct.Struct(fmtstring)
if sys.version_info[0] < 3:
parse = fieldstruct.unpack_from
else:
# converts unicode input to byte string and results back to unicode string
unpack = fieldstruct.unpack_from
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))
这是一种使用字符串切片的方法,正如您所考虑的那样但是担心它可能会变得太难看。关于它的好处是,除了不是那么丑陋之外,它在Python 2和3中都能保持不变,并且能够处理Unicode字符串。我没有对它进行基准测试,但怀疑它可能与struct
模块版本的速度竞争。通过删除填充字段的能力,可以略微加快速度。
try:
from itertools import izip_longest # added in Py 2.6
except ImportError:
from itertools import zip_longest as izip_longest # name change in Py 3.x
try:
from itertools import accumulate # added in Py 3.2
except ImportError:
def accumulate(iterable):
'Return running totals (simplified version).'
total = next(iterable)
yield total
for value in iterable:
total += value
yield total
def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
# optional informational function attributes
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))
输出:
format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
答案 2 :(得分:19)
比已经提到的解决方案更容易和更漂亮的两个选项:
第一个是使用熊猫:
import pandas as pd
path = 'filename.txt'
# Using Pandas with a column specification
col_specification = [(0, 20), (21, 30), (31, 50), (51, 100)]
data = pd.read_fwf(path, colspecs=col_specification)
使用numpy.loadtxt的第二个选项:
import numpy as np
# Using NumPy and letting it figure it out automagically
data_also = np.loadtxt(path)
这实际上取决于您希望以何种方式使用您的数据。
答案 3 :(得分:11)
下面的代码概述了如果要进行严格的固定列宽文件处理,您可能想要做什么。
“严重”=多种文件类型中的多种记录类型,最多1000个字节的记录,布局定义者和“对立”生产者/消费者是政府部门的态度,布局更改导致未使用的列,最多文件中有一百万条记录,......
功能:预编译结构格式。忽略不需要的列。将输入字符串转换为所需的数据类型(草图省略错误处理)。如果您愿意,可以将记录转换为对象实例(或dicts或命名元组)。
代码:
import struct, datetime, io, pprint
# functions for converting input fields to usable data
cnv_text = rstrip
cnv_int = int
cnv_date_dmy = lambda s: datetime.datetime.strptime(s, "%d%m%Y") # ddmmyyyy
# etc
# field specs (field name, start pos (1-relative), len, converter func)
fieldspecs = [
('surname', 11, 20, cnv_text),
('given_names', 31, 20, cnv_text),
('birth_date', 51, 8, cnv_date_dmy),
('start_date', 71, 8, cnv_date_dmy),
]
fieldspecs.sort(key=lambda x: x[1]) # just in case
# build the format for struct.unpack
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
start = fieldspec[1] - 1
end = start + fieldspec[2]
if start > unpack_len:
unpack_fmt += str(start - unpack_len) + "x"
unpack_fmt += str(end - start) + "s"
unpack_len = end
field_indices = range(len(fieldspecs))
print unpack_len, unpack_fmt
unpacker = struct.Struct(unpack_fmt).unpack_from
class Record(object):
pass
# or use named tuples
raw_data = """\
....v....1....v....2....v....3....v....4....v....5....v....6....v....7....v....8
Featherstonehaugh Algernon Marmaduke 31121969 01012005XX
"""
f = cStringIO.StringIO(raw_data)
headings = f.next()
for line in f:
# The guts of this loop would of course be hidden away in a function/method
# and could be made less ugly
raw_fields = unpacker(line)
r = Record()
for x in field_indices:
setattr(r, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))
pprint.pprint(r.__dict__)
print "Customer name:", r.given_names, r.surname
输出:
78 10x20s20s8s12x8s
{'birth_date': datetime.datetime(1969, 12, 31, 0, 0),
'given_names': 'Algernon Marmaduke',
'start_date': datetime.datetime(2005, 1, 1, 0, 0),
'surname': 'Featherstonehaugh'}
Customer name: Algernon Marmaduke Featherstonehaugh
答案 4 :(得分:4)
> str = '1234567890'
> w = [0,2,5,7,10]
> [ str[ w[i-1] : w[i] ] for i in range(1,len(w)) ]
['12', '345', '67', '890']
答案 5 :(得分:0)
这是一个基于John Machin's answer的Python 3的简单模块 - 根据需要进行调整:)
"""
fixedwidth
Parse and iterate through a fixedwidth text file, returning record objects.
Adapted from https://stackoverflow.com/a/4916375/243392
USAGE
import fixedwidth, pprint
# define the fixed width fields we want
# fieldspecs is a list of [name, description, start, width, type] arrays.
fieldspecs = [
["FILEID", "File Identification", 1, 6, "A/N"],
["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
["SUMLEV", "Summary Level", 9, 3, "A/N"],
["LOGRECNO", "Logical Record Number", 19, 7, "N"],
["POP100", "Population Count (100%)", 30, 9, "N"],
]
# define the fieldtype conversion functions
fieldtype_fns = {
'A': str.rstrip,
'A/N': str.rstrip,
'N': int,
}
# iterate over record objects in the file
with open(f, 'rb'):
for record in fixedwidth.reader(f, fieldspecs, fieldtype_fns):
pprint.pprint(record.__dict__)
# output:
{'FILEID': 'SF1ST', 'LOGRECNO': 2, 'POP100': 1, 'STUSAB': 'TX', 'SUMLEV': '040'}
{'FILEID': 'SF1ST', 'LOGRECNO': 3, 'POP100': 2, 'STUSAB': 'TX', 'SUMLEV': '040'}
...
"""
import struct, io
# fieldspec columns
iName, iDescription, iStart, iWidth, iType = range(5)
def get_struct_unpacker(fieldspecs):
"""
Build the format string for struct.unpack to use, based on the fieldspecs.
fieldspecs is a list of [name, description, start, width, type] arrays.
Returns a string like "6s2s3s7x7s4x9s".
"""
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
start = fieldspec[iStart] - 1
end = start + fieldspec[iWidth]
if start > unpack_len:
unpack_fmt += str(start - unpack_len) + "x"
unpack_fmt += str(end - start) + "s"
unpack_len = end
struct_unpacker = struct.Struct(unpack_fmt).unpack_from
return struct_unpacker
class Record(object):
pass
# or use named tuples
def reader(f, fieldspecs, fieldtype_fns):
"""
Wrap a fixedwidth file and return records according to the given fieldspecs.
fieldspecs is a list of [name, description, start, width, type] arrays.
fieldtype_fns is a dictionary of functions used to transform the raw string values,
one for each type.
"""
# make sure fieldspecs are sorted properly
fieldspecs.sort(key=lambda fieldspec: fieldspec[iStart])
struct_unpacker = get_struct_unpacker(fieldspecs)
field_indices = range(len(fieldspecs))
for line in f:
raw_fields = struct_unpacker(line) # split line into field values
record = Record()
for i in field_indices:
fieldspec = fieldspecs[i]
fieldname = fieldspec[iName]
s = raw_fields[i].decode() # convert raw bytes to a string
fn = fieldtype_fns[fieldspec[iType]] # get conversion function
value = fn(s) # convert string to value (eg to an int)
setattr(record, fieldname, value)
yield record
if __name__=='__main__':
# test module
import pprint, io
# define the fields we want
# fieldspecs are [name, description, start, width, type]
fieldspecs = [
["FILEID", "File Identification", 1, 6, "A/N"],
["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
["SUMLEV", "Summary Level", 9, 3, "A/N"],
["LOGRECNO", "Logical Record Number", 19, 7, "N"],
["POP100", "Population Count (100%)", 30, 9, "N"],
]
# define a conversion function for integers
def to_int(s):
"""
Convert a numeric string to an integer.
Allows a leading ! as an indicator of missing or uncertain data.
Returns None if no data.
"""
try:
return int(s)
except:
try:
return int(s[1:]) # ignore a leading !
except:
return None # assume has a leading ! and no value
# define the conversion fns
fieldtype_fns = {
'A': str.rstrip,
'A/N': str.rstrip,
'N': to_int,
# 'N': int,
# 'D': lambda s: datetime.datetime.strptime(s, "%d%m%Y"), # ddmmyyyy
# etc
}
# define a fixedwidth sample
sample = """\
SF1ST TX04089000 00000023748 1
SF1ST TX04090000 00000033748! 2
SF1ST TX04091000 00000043748!
"""
sample_data = sample.encode() # convert string to bytes
file_like = io.BytesIO(sample_data) # create a file-like wrapper around bytes
# iterate over record objects in the file
for record in reader(file_like, fieldspecs, fieldtype_fns):
# print(record)
pprint.pprint(record.__dict__)
答案 6 :(得分:0)
以下是NumPy在幕后使用的内容(非常简化,但是 - 此代码位于class myClass
{
static const int x = 4;
};
内的LineSplitter class
中):
_iotools module
它不会处理用于忽略列的负分隔符,因此它不像import numpy as np
DELIMITER = (20, 10, 10, 20, 10, 10, 20)
idx = np.cumsum([0] + list(DELIMITER))
slices = [slice(i, j) for (i, j) in zip(idx[:-1], idx[1:])]
def parse(line):
return [line[s] for s in slices]
那样通用,但它更快。
答案 7 :(得分:0)
只要保持条理清晰,字符串切片就不必太丑陋。考虑将字段宽度存储在字典中,然后使用相关名称创建对象:
from collections import OrderedDict
class Entry:
def __init__(self, line):
name2width = OrderedDict()
name2width['foo'] = 2
name2width['bar'] = 3
name2width['baz'] = 2
pos = 0
for name, width in name2width.items():
val = line[pos : pos + width]
if len(val) != width:
raise ValueError("not enough characters: \'{}\'".format(line))
setattr(self, name, val)
pos += width
file = "ab789yz\ncd987wx\nef555uv"
entry = []
for line in file.split('\n'):
entry.append(Entry(line))
print(entry[1].bar) # output: 987
答案 8 :(得分:0)
由于我的旧工作通常处理100万行的fixwidth数据,因此我在开始使用Python时就对此问题进行了研究。
有2种类型的FixedWidth
如果资源字符串全部由ascii字符组成,则ASCII FixedWidth = Unicode FixedWidth
幸运的是,py3中的字符串和字节有所不同,这在处理双字节编码字符(例如gbk,big5,euc-jp,shift-jis等)时减少了很多混乱。 为了处理“ ASCII FixedWidth”,通常将字符串转换为字节,然后拆分。
不导入第三方模块
totalLineCount = 1百万,lineLength = 800字节,FixedWidthArgs =(10,25,4,....),我用大约5种方式分割Line并得到以下结论:
slice(bytes)
比slice(string)
在处理大文件时,我们经常使用with open ( file, "rb") as f:
。
该方法遍历上述文件之一,大约需要2.4秒。
我认为合适的处理程序可以处理100万行数据,将每一行分成20个字段,并且花费不到2.4秒。
我只发现stuct
和itemgetter
符合要求
ps:为了正常显示,我将unistr str转换为字节。 如果您在双字节环境中,则不需要这样做。
from itertools import accumulate
from operator import itemgetter
def oprt_parser(sArgs):
sum_arg = tuple(accumulate(abs(i) for i in sArgs))
# Negative parameter field index
cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
# Get slice args and Ignore fields of negative length
ig_Args = tuple(item for i, item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
# Generate `operator.itemgetter` object
oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
return oprtObj
lineb = b'abcdefghijklmnopqrstuvwxyz\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4\xb6\xee\xb7\xa2\xb8\xf6\xba\xcd0123456789'
line = lineb.decode("GBK")
# Unicode Fixed Width
fieldwidthsU = (13, -13, 4, -4, 5,-5) # Negative width fields is ignored
# ASCII Fixed Width
fieldwidths = (13, -13, 8, -8, 5,-5) # Negative width fields is ignored
# Unicode FixedWidth processing
parse = oprt_parser(fieldwidthsU)
fields = parse(line)
print('Unicode FixedWidth','fields: {}'.format(tuple(map(lambda s: s.encode("GBK"), fields))))
# ASCII FixedWidth processing
parse = oprt_parser(fieldwidths)
fields = parse(lineb)
print('ASCII FixedWidth','fields: {}'.format(fields))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)
parse = oprt_parser(fieldwidths)
fields = parse(line)
print(f"fields: {fields}")
输出:
Unicode FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
ASCII FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
oprt_parser
是4倍make_parser
(列表理解力+分片)
在研究过程中发现,当cpu速度更快时,re
方法的效率似乎提高得更快。
由于我没有更多更好的计算机要测试,因此请提供我的测试代码,如果有兴趣的人可以使用速度更快的计算机进行测试。
运行环境:
import timeit
import time
import re
from itertools import accumulate
from operator import itemgetter
def eff2(stmt,onlyNum= False,showResult=False):
'''test function'''
if onlyNum:
rl = timeit.repeat(stmt=stmt,repeat=roundI,number=timesI,globals=globals())
avg = sum(rl) / len(rl)
return f"{avg * (10 ** 6)/timesI:0.4f}"
else:
rl = timeit.repeat(stmt=stmt,repeat=10,number=1000,globals=globals())
avg = sum(rl) / len(rl)
print(f"【{stmt}】")
print(f"\tquick avg = {avg * (10 ** 6)/1000:0.4f} s/million")
if showResult:
print(f"\t Result = {eval(stmt)}\n\t timelist = {rl}\n")
else:
print("")
def upDouble(argList,argRate):
return [c*argRate for c in argList]
tbStr = "000000001111000002222真2233333333000000004444444QAZ55555555000000006666666ABC这些事中文字abcdefghijk"
tbBytes = tbStr.encode("GBK")
a20 = (4,4,2,2,2,3,2,2, 2 ,2,8,8,7,3,8,8,7,3, 12 ,11)
a20U = (4,4,2,2,2,3,2,2, 1 ,2,8,8,7,3,8,8,7,3, 6 ,11)
Slng = 800
rateS = Slng // 100
tStr = "".join(upDouble(tbStr , rateS))
tBytes = tStr.encode("GBK")
spltArgs = upDouble( a20 , rateS)
spltArgsU = upDouble( a20U , rateS)
testList = []
timesI = 100000
roundI = 5
print(f"test round = {roundI} timesI = {timesI} sourceLng = {len(tStr)} argFieldCount = {len(spltArgs)}")
print(f"pure str \n{''.ljust(60,'-')}")
# ==========================================
def str_parser(sArgs):
def prsr(oStr):
r = []
r_ap = r.append
stt=0
for lng in sArgs:
end = stt + lng
r_ap(oStr[stt:end])
stt = end
return tuple(r)
return prsr
Str_P = str_parser(spltArgsU)
# eff2("Str_P(tStr)")
testList.append("Str_P(tStr)")
print(f"pure bytes \n{''.ljust(60,'-')}")
# ==========================================
def byte_parser(sArgs):
def prsr(oBytes):
r, stt = [], 0
r_ap = r.append
for lng in sArgs:
end = stt + lng
r_ap(oBytes[stt:end])
stt = end
return r
return prsr
Byte_P = byte_parser(spltArgs)
# eff2("Byte_P(tBytes)")
testList.append("Byte_P(tBytes)")
# re,bytes
print(f"re compile object \n{''.ljust(60,'-')}")
# ==========================================
def rebc_parser(sArgs,otype="b"):
re_Args = "".join([f"(.{{{n}}})" for n in sArgs])
if otype == "b":
rebc_Args = re.compile(re_Args.encode("GBK"))
else:
rebc_Args = re.compile(re_Args)
def prsr(oBS):
return rebc_Args.match(oBS).groups()
return prsr
Rebc_P = rebc_parser(spltArgs)
# eff2("Rebc_P(tBytes)")
testList.append("Rebc_P(tBytes)")
Rebc_Ps = rebc_parser(spltArgsU,"s")
# eff2("Rebc_Ps(tStr)")
testList.append("Rebc_Ps(tStr)")
print(f"struct \n{''.ljust(60,'-')}")
# ==========================================
import struct
def struct_parser(sArgs):
struct_Args = " ".join(map(lambda x: str(x) + "s", sArgs))
def prsr(oBytes):
return struct.unpack(struct_Args, oBytes)
return prsr
Struct_P = struct_parser(spltArgs)
# eff2("Struct_P(tBytes)")
testList.append("Struct_P(tBytes)")
print(f"List Comprehensions + slice \n{''.ljust(60,'-')}")
# ==========================================
import itertools
def slice_parser(sArgs):
tl = tuple(itertools.accumulate(sArgs))
slice_Args = tuple(zip((0,)+tl,tl))
def prsr(oBytes):
return [oBytes[s:e] for s, e in slice_Args]
return prsr
Slice_P = slice_parser(spltArgs)
# eff2("Slice_P(tBytes)")
testList.append("Slice_P(tBytes)")
def sliceObj_parser(sArgs):
tl = tuple(itertools.accumulate(sArgs))
tl2 = tuple(zip((0,)+tl,tl))
sliceObj_Args = tuple(slice(s,e) for s,e in tl2)
def prsr(oBytes):
return [oBytes[so] for so in sliceObj_Args]
return prsr
SliceObj_P = sliceObj_parser(spltArgs)
# eff2("SliceObj_P(tBytes)")
testList.append("SliceObj_P(tBytes)")
SliceObj_Ps = sliceObj_parser(spltArgsU)
# eff2("SliceObj_Ps(tStr)")
testList.append("SliceObj_Ps(tStr)")
print(f"operator.itemgetter + slice object \n{''.ljust(60,'-')}")
# ==========================================
def oprt_parser(sArgs):
sum_arg = tuple(accumulate(abs(i) for i in sArgs))
cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
ig_Args = tuple(item for i,item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
return oprtObj
Oprt_P = oprt_parser(spltArgs)
# eff2("Oprt_P(tBytes)")
testList.append("Oprt_P(tBytes)")
Oprt_Ps = oprt_parser(spltArgsU)
# eff2("Oprt_Ps(tStr)")
testList.append("Oprt_Ps(tStr)")
print("|".join([s.split("(")[0].center(11," ") for s in testList]))
print("|".join(["".center(11,"-") for s in testList]))
print("|".join([eff2(s,True).rjust(11," ") for s in testList]))
输出:
Test round = 5 timesI = 100000 sourceLng = 744 argFieldCount = 20
...
...
Str_P | Byte_P | Rebc_P | Rebc_Ps | Struct_P | Slice_P | SliceObj_P|SliceObj_Ps| Oprt_P | Oprt_Ps
-----------|-----------|-----------|-----------|-- ---------|-----------|-----------|-----------|---- -------|-----------
9.6315| 7.5952| 4.4187| 5.6867| 1.5123| 5.2915| 4.2673| 5.7121| 2.4713| 3.9051
答案 9 :(得分:0)
这是我用字典解决的方法,该字典包含字段开始和结束的位置。提供起点和终点也有助于我在专栏的长度上进行更改。
# fixed length
# '---------- ------- ----------- -----------'
line = '20.06.2019 myname active mydevice '
SLICES = {'date_start': 0,
'date_end': 10,
'name_start': 11,
'name_end': 18,
'status_start': 19,
'status_end': 30,
'device_start': 31,
'device_end': 42}
def get_values_as_dict(line, SLICES):
values = {}
key_list = {key.split("_")[0] for key in SLICES.keys()}
for key in key_list:
values[key] = line[SLICES[key+"_start"]:SLICES[key+"_end"]].strip()
return values
>>> print (get_values_as_dict(line,SLICES))
{'status': 'active', 'name': 'myname', 'date': '20.06.2019', 'device': 'mydevice'}
答案 10 :(得分:0)
我喜欢使用regular expressions处理包含固定宽度字段的文本文件。更具体地说,使用named capture groups。它速度快,不需要导入大型库,并且描述性强且方便(我认为)。
我还喜欢这样一个事实,即命名捕获组基本上可以自动记录数据格式,这是一种数据规范,因为可以编写每个捕获组来定义每个字段的名称,数据类型和长度。
这是一个简单的例子...
import re
data = [
"1234ABCDEFGHIJ5",
"6789KLMNOPQRST0"
]
record_regex = (
r"^"
r"(?P<firstnumbers>[0-9]{4})"
r"(?P<middletext>[a-zA-Z0-9_\-\s]{10})"
r"(?P<lastnumber>[0-9]{1})"
r"$"
)
records = []
for line in data:
match = re.match(record_regex, line)
if match:
records.append(match.groupdict())
print(records)
...为每个记录生成一个方便的字典:
[
{'firstnumbers': '1234', 'lastnumber': '5', 'middletext': 'ABCDEFGHIJ'},
{'firstnumbers': '6789', 'lastnumber': '0', 'middletext': 'KLMNOPQRST'}
]
如果您对Python正则表达式或命名捕获组不熟悉(或不熟悉),则可以使用online regex tester and debugger之类的有用工具。