Question

该文件具有以下格式：

Component_name - version - author@email.com - 包含新行和其他空白字符的多行注释
     \ t ...继续多行评论
  Component_name2 - version - author2@email.com - 可能包含新行和其他空白字符的多行注释
  Component_name - version - author@email.com - 可能包含新行和其他空白字符的多行注释2
  Component_name - version - author2@email.com - 可能包含新行和其他空白字符的多行注释2
  等等...

解析后，输出格式应按component_name分组：

output = [
     "component_name" -> ["version - author@email.com - comment 1", "version - author@email.com - comment 2", ...],
     "component_name2" -> [...],
     ...
]

目前，这是我到目前为止解析它的原因：

reTemp = r"[\w\_\-]*( \- )(\d*\.?){3}( \- )[\w\d\_\-\.\@]*( \- )[\S ]*"
numData = 4
reFormat = re.compile(reTemp)

textFileLines = textFile.split("\n")
temp = [x.split(" - ", numData - 1) for x in textFileLines if re.search(reFormat, x)]
m = filter(None, temp) # remove all empty lists
group = groupby(m, lambda y: y[0].strip())

这适用于单行注释，但无法使用多行注释。此外，我不确定Regex是否是正确的工具。是否有更好的/ pythonic方式来做到这一点？

编辑：

多行注释在新行上以制表符分隔\t（例如，查看上面的第一个条目）
注释是GIT提交消息，可以包含JSON或代码
条目由换行符

Answer 1

我必须处理这样的结构化数据文件，最后编写一个状态机来解析文件。像这样的东西（粗糙的伪代码）：

for line in file:
    if line matches new_record_regex:
        records.append(record)
        record = {"version": field1, "author": field2, "comment": field3}
    else:
        record["comment"] += line

Answer 2

您可能希望将文件格式形式化为语法，然后使用Python提供的许多parsers / parser generators之一来根据语法解释文件。

从文件python解析结构化数据

2 个答案: