假设我在文本文件中有几条记录。
它们通过空行(即\n\n
)彼此分开。
所有记录都遵循相同的格式:在每条记录中:
\n
分隔。 (所以一个字段必须在一行内)。 在记录的每个字段中:
\n
结束E.g。
用户指定每条记录有四个名为A, B, C, D
的字段,其长度限制为4个,3个,10个,5个。文本文件中有两条记录:
aaaa, bb
ccc
ddddd
ee ,fff
gggg,ggg
hhhh
我们如何编写一个程序来将文本文件读入字典列表,每个字典代表一个记录:
>>> records[0]
{'A':'aaaa', 'B':' bb', 'C':'ccc', 'D':'ddddd'}
>>> records[1]
{'A':' ee ', 'B':'fff', 'C':'gggg,ggg', 'D':'hhhh'}
注意:字段值中的前导或尾随空格不重要
感谢。
更难的问题是:
例如,让我们通过允许第三个字段跨越多行来改变上面的例子。将第二条记录更改为
ee ,ff
ggg
g,ggg
hhhhh
第三个字段C
是:
ggg
g,ggg
我们如何实现?
答案 0 :(得分:1)
查看re
模块的MULTILINE。
你所描述的是一个相当明确的记录,所以正如这样的正则表达式能够解析它:
re.compile(r"^.{1,4}[,]?.{1,3}[,]?.{1,10}[,]?.{1,5}[,]?$", re.MULTILINE)
答案 1 :(得分:0)
def get_records(fname, field_lengths):
"""Return list of dict, where
each list represents a record,
as requested in the problem.
fname : str
the filename
field_lengths : list of int
the specified field lengths
"""
fields = tuple("ABCD") #the specified field names
fin = open(fname,"r")
result = list()
#could use a regex, but let's brute force it:
#extract records one at a time from the file
#(record extraction is delegated to `get_record`)
keepreading = True
while keepreading:
try:
result.append( get_record(fin, field_lengths) )
_ = next(fin) #discard record separator
except StopIteration:
keepreading = False
return result
def get_record(fin, field_lengths):
"""Return list of str, representing one record.
fin : filehandle
the file for record extraction
field_lengths : list of int
the specified field lengths
"""
n = len(field_lengths)
record_strings = [] #list to hold one record
line = next(fin).rstrip("\n")
for i, l in enumerate(field_lengths):
if len(line) <= l: #only one record on this line
record_strings.append(line)
if (i < n-1):
line = next(fin).rstrip("\n")
else: #multiple records on this line
record_strings.append(line[:l])
line = line[l:].lstrip(",")
#now we have a record as a list of strings,
#but the problem asks for a dict, convert it.
result = process_record_strings(record_strings)
return result
def process_record_strings(ss):
"""Return dict, mapping field names to values.
Input `ss` is a list of strings representing a record.
White space is stripped from these strings, as in the
problem example.
"""
A, B, C, D = map(lambda x: x.strip(), ss)
return dict(zip("ABCD",(A,B,C,D)))
#Example use:
field_lengths = 4,3,10,5
print get_records("temp.txt", field_lengths=field_lengths)