Question

假设我在文本文件中有几条记录。它们通过空行（即\n\n）彼此分开。

所有记录都遵循相同的格式：在每条记录中：

有固定数量的字段，
字段由逗号或换行符\n分隔。（所以一个字段必须在一行内）。

在记录的每个字段中：

新行字符永远不会成为任何字段的一部分，而逗号可以是字段的一部分。
每个字段的长度（以字符数表示）都有一定的限制，这是由用户输入提供的。
如果字段的字符数小于其限制，则必须以换行符\n结束
如果某个字段在同一行后面有另一个字段，则必须达到其长度限制，并用逗号与以下字段分隔。
字段可以为空，即空行（也是记录分隔符，但我们知道每条记录中的字段数，因此我们可以区分这两种情况）

E.g。

用户指定每条记录有四个名为A, B, C, D的字段，其长度限制为4个，3个，10个，5个。文本文件中有两条记录：

aaaa, bb
ccc
ddddd

 ee ,fff
gggg,ggg
hhhh

我们如何编写一个程序来将文本文件读入字典列表，每个字典代表一个记录：

>>> records[0] 
{'A':'aaaa', 'B':' bb', 'C':'ccc', 'D':'ddddd'}
>>> records[1] 
{'A':' ee ', 'B':'fff', 'C':'gggg,ggg', 'D':'hhhh'}

注意：字段值中的前导或尾随空格不重要

感谢。

更难的问题是：

我们允许单个特定字段跨越一行或多行，并知道它是哪个字段。
我们也知道以下字段始终达到长度限制
其他任何字段都不能跨越多行

例如，让我们通过允许第三个字段跨越多行来改变上面的例子。将第二条记录更改为

 ee ,ff
ggg
g,ggg
hhhhh

第三个字段C是：

ggg
g,ggg

我们如何实现？

Answer 1

查看re模块的MULTILINE。
你所描述的是一个相当明确的记录，所以正如这样的正则表达式能够解析它：

re.compile(r"^.{1,4}[,]?.{1,3}[,]?.{1,10}[,]?.{1,5}[,]?$", re.MULTILINE)

Answer 2

def get_records(fname, field_lengths):
  """Return list of dict, where
  each list represents a record,
  as requested in the problem.
  fname : str
    the filename
  field_lengths : list of int
    the specified field lengths
  """
  fields = tuple("ABCD")  #the specified field names
  fin = open(fname,"r")
  result = list()
  #could use a regex, but let's brute force it:
  #extract records one at a time from the file
  #(record extraction is delegated to `get_record`)
  keepreading = True
  while keepreading:
    try:
      result.append( get_record(fin, field_lengths) )
      _ = next(fin) #discard record separator
    except StopIteration:
      keepreading = False
  return result

def get_record(fin, field_lengths): 
  """Return list of str, representing one record.
  fin : filehandle
    the file for record extraction
  field_lengths : list of int
    the specified field lengths
  """
  n = len(field_lengths)
  record_strings = []  #list to hold one record
  line = next(fin).rstrip("\n")
  for i, l in enumerate(field_lengths):
    if len(line) <= l: #only one record on this line
      record_strings.append(line)
      if (i < n-1):
        line = next(fin).rstrip("\n")
    else: #multiple records on this line
      record_strings.append(line[:l])
      line = line[l:].lstrip(",")
  #now we have a record as a list of strings,
  #but the problem asks for a dict, convert it.
  result = process_record_strings(record_strings)
  return result

def process_record_strings(ss):
  """Return dict, mapping field names to values.
  Input `ss` is a list of strings representing a record.
  White space is stripped from these strings, as in the
  problem example.
  """
  A, B, C, D = map(lambda x: x.strip(), ss)
  return dict(zip("ABCD",(A,B,C,D)))

#Example use:
field_lengths = 4,3,10,5
print get_records("temp.txt", field_lengths=field_lengths)

读取具有字段长度限制的记录

2 个答案: