Question

我正在尝试找到解析Python文件的最佳方法，并创建一个namedtuples列表，每个元组代表一个数据实体及其属性。数据看起来像这样：

UI: T020  
STY: Acquired Abnormality  
ABR: acab   
STN: A1.2.2.2  
DEF: An abnormal structure, or one that is abnormal in size or location, found   
in or deriving from a previously normal structure.  Acquired abnormalities are  
distinguished from diseases even though they may result in pathological   
functioning (e.g., "hernias incarcerate").   
HL: {isa} Anatomical Abnormality

UI: T145   
RL: exhibits   
ABR: EX   
RIN: exhibited_by   
RTN: R3.3.2   
DEF: Shows or demonstrates.   
HL: {isa} performs   
STL: [Animal|Behavior]; [Group|Behavior]   

UI: etc...

虽然共享了几个属性（例如UI），但有些属性不是（例如STY）。但是，我可以硬编码必要的详尽清单由于每个分组都由空行分隔，因此我使用了分割，因此我可以单独处理每个数据块：

input = file.read().split("\n\n")
for chunk in input:
     process(chunk)

我已经看到一些方法使用字符串查找/拼接，itertools.groupby，甚至正则表达式。我正在考虑使用'[AZ] *：'的正则表达式来查找标题的位置，但我不确定如何在之后拉出多行直到达到另一个标题（例如DEF之后的多行数据）第一个示例实体）。

我感谢任何建议。

Answer 1

source = """
UI: T020  
STY: Acquired Abnormality  
ABR: acab   
STN: A1.2.2.2  
DEF: An abnormal structure, or one that is abnormal in size or location, found   
in or deriving from a previously normal structure.  Acquired abnormalities are  
distinguished from diseases even though they may result in pathological   
functioning (e.g., "hernias incarcerate").   
HL: {isa} Anatomical Abnormality
"""

inpt = source.split("\n")  #just emulating file

import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
    line_match = reg.match(line) #check if we hit the CODE: Content line
    if line_match is not None:
        if current_key is not None:
            output[current_key] = current #if so - update the current_key with contents
        current_key = line_match.group(1)   
        current = line_match.group(2)
    else:
        current = current + line   #if it's not - it should be the continuation of previous key line

output[current_key] = current #don't forget the last guy
print(output)

Answer 2

我假设如果你在多行上有字符串跨度，你想要用空格替换换行符（并删除任何其他空格）。

def process_file(filename):
    reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
    tmp = '' # Stored/cached data for mutliline string
    key = None # Current key
    data = {}

    with open(filename,'r') as f:
        for row in f:
            row = row.rstrip()
            match = reg.match(row)

            # Matches header or is end, put string to list:
            if (match or not row) and key:
                data[key] = tmp
                key = None
                tmp = ''

            # Empty row, next dataset
            if not row:
                # Prevent empty returns
                if data:
                    yield data
                    data = {}

                continue

            # We do have header
            if match:
                key = str(match.group(1))
                tmp = row[len(match.group(0)):]
                continue

            # No header, just append string -> here goes assumption that you want to
            # remove newlines, trailing spaces and replace them with one single space
            tmp += ' ' + row

    # Missed row?
    if key:
        data[key] = tmp

    # Missed group?
    if data:
        yield data

此生成器在每次迭代中返回dict对UI: T020（并且始终至少有一个项目）。

由于它使用生成器和连续读取，因此它应该是大文件上的有效事件，并且不会立即将整个文件读入内存。

这是一个小小的演示：

for data in process_file('data.txt'):
    print('-'*20)
    for i in data:
        print('%s:'%(i), data[i])

    print()

实际输出：

--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure.  Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab

--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX

Answer 3

import re
from collections import namedtuple

def process(chunk):
    split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
    d = dict()
    fields = list()
    for i in xrange(len(split_chunk)/2):
        fields.append(split_chunk[i])
        d[split_chunk[i]] = split_chunk[i+1]
    my_tuple = namedtuple(split_chunk[1], fields)
    return my_tuple(**d)

应该这样做。我想我只是做了dict - 为什么你如此依恋namedtuple？

Python - 从具有可变属性和行长度的文件中读取数据

3 个答案: