我正在尝试找到解析Python文件的最佳方法,并创建一个namedtuples列表,每个元组代表一个数据实体及其属性。数据看起来像这样:
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]
UI: etc...
虽然共享了几个属性(例如UI),但有些属性不是(例如STY)。但是,我可以硬编码必要的详尽清单 由于每个分组都由空行分隔,因此我使用了分割,因此我可以单独处理每个数据块:
input = file.read().split("\n\n")
for chunk in input:
process(chunk)
我已经看到一些方法使用字符串查找/拼接,itertools.groupby,甚至正则表达式。我正在考虑使用'[AZ] *:'的正则表达式来查找标题的位置,但我不确定如何在之后拉出多行直到达到另一个标题(例如DEF之后的多行数据)第一个示例实体)。
我感谢任何建议。
答案 0 :(得分:2)
source = """
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
"""
inpt = source.split("\n") #just emulating file
import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
line_match = reg.match(line) #check if we hit the CODE: Content line
if line_match is not None:
if current_key is not None:
output[current_key] = current #if so - update the current_key with contents
current_key = line_match.group(1)
current = line_match.group(2)
else:
current = current + line #if it's not - it should be the continuation of previous key line
output[current_key] = current #don't forget the last guy
print(output)
答案 1 :(得分:2)
我假设如果你在多行上有字符串跨度,你想要用空格替换换行符(并删除任何其他空格)。
def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}
with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)
# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''
# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}
continue
# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue
# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row
# Missed row?
if key:
data[key] = tmp
# Missed group?
if data:
yield data
此生成器在每次迭代中返回dict
对UI: T020
(并且始终至少有一个项目)。
由于它使用生成器和连续读取,因此它应该是大文件上的有效事件,并且不会立即将整个文件读入内存。
这是一个小小的演示:
for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])
print()
实际输出:
--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab
--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX
答案 2 :(得分:0)
import re
from collections import namedtuple
def process(chunk):
split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
d = dict()
fields = list()
for i in xrange(len(split_chunk)/2):
fields.append(split_chunk[i])
d[split_chunk[i]] = split_chunk[i+1]
my_tuple = namedtuple(split_chunk[1], fields)
return my_tuple(**d)
应该这样做。我想我只是做了dict
- 为什么你如此依恋namedtuple
?