我有一个具有以下格式的txt文件:
Intestinal infectious diseases (001-003)
001 Cholera
002 Fever
003 Salmonella
Zoonotic bacterial diseases (020-022)
020 Plague
021 Tularemia
022 Anthrax
External Cause Status (E000)
E000 External cause status
Activity (E001-E002)
E001 Activities involving x and y
E002 Other activities
其中以3整数代码/ E + 3整数代码/ V + 3整数代码开头的每一行是前面标头的值,这是我字典的键。在我看到的其他问题中,可以使用列或冒号来解析每一行以创建键/值对,但是txt文件的格式不允许我这样做。
是否可以将这样的txt文件制作成字典,其中的键是组名,值是代码+疾病名?
我还需要将代码和疾病名称解析为第二个字典,因此我最终得到了一个字典,该字典包含组名作为关键字,值是第二个字典,代码作为关键字,疾病名称为价值观。
def process_file(filename):
myDict={}
f = open(filename, 'r')
for line in f:
if line[0] is not int:
if line.startswith("E"):
if line[1] is int:
line = dictionary1_values
else:
break
else:
line = dictionary1_key
myDict[dictionary1_key].append[line]
所需的输出格式为:
{"Intestinal infectious diseases (001-003)": {"001": "Cholera", "002": "Fever", "003": "Salmonella"}, "Zoonotic bacterial diseases (020-022)": {"020": "Plague", "021": "Tularemia", "022": "Anthrax"}, "External Cause Status (E000)": {"E000": "External cause status"}, "Activity (E001-E002)": {"E001": "Activities involving x and y", "E002": "Other activities"}}
答案 0 :(得分:0)
尝试使用正则表达式确定是标头还是疾病
import re
mydict = {}
with open(filename, "r") as f:
header = None
for line in f:
match_desease = re.match(r"(E?\d\d\d) (.*)", line)
if not match_desease:
header = line
else:
code = match_desease.group(1)
desease = match_desease.group(2)
mydict[header][code] = desease
答案 1 :(得分:0)
一种解决方案是使用正则表达式来帮助您表征和解析此文件中可能遇到的两种类型的行:
import re
header_re = re.compile(r'([\w\s]+) \(([\w\s\-]+)\)')
entry_re = re.compile(r'([EV]?\d{3}) (.+)')
这使您可以非常轻松地检查遇到的线型,并根据需要将其分开:
# Check if a line is a header:
header = header_re.match(line)
if header:
header_name, header_codes = header.groups() # e.g. ('Intestinal infectious diseases', '001-009')
# Do whatever you need to do when you encounter a new group
# ...
else:
entry = entry_re.match(line)
# If the line wasn't a header, it ought to be an entry,
# otherwise we've encountered something we didn't expect
assert entry is not None
entry_number, entry_name = entry.groups() # e.g. ('001', 'Cholera')
# Do whatever you need to do when you encounter an entry in a group
# ...
使用它来重新编写您的功能,我们可以编写以下内容:
import re
def process_file(filename):
header_re = re.compile(r'([\w\s]+) \(([\w\s\-]+)\)')
entry_re = re.compile(r'([EV]?\d{3}) (.+)')
all_groups = {}
current_group = None
with open(filename, 'r') as f:
for line in f:
# Check if a line is a header:
header = header_re.match(line)
if header:
current_group = {}
all_groups[header.group(0)] = current_group
else:
entry = entry_re.match(line)
# If the line wasn't a header, it ought to be an entry,
# otherwise we've encountered something we didn't expect
assert entry is not None
entry_number, entry_name = entry.groups() # e.g. ('001', 'Cholera')
current_group[entry_number] = entry_name
return all_groups
答案 2 :(得分:0)
def process_file(filename):
myDict = {}
rootkey = None
f = open(filename, 'r')
for line in f:
if line[1:3].isdigit(): # if the second and third character from the checked string (line) is the ASCII Code in range 0x30..0x39 ("0".."9"), i.e.: str.isdigit()
subkey, data = line.rstrip().split(" ",1) # split into two parts... the first one is the number with or without "E" at begin
myDict[rootkey][subkey] = data
else:
rootkey = line.rstrip() # str.rstrip() is used to delete newlines (or another so called "empty spaces")
myDict[rootkey] = {} # prepare a new empty rootkey into your myDict
f.close()
return myDict
在Python控制台中进行测试:
>>> d = process_file('/tmp/file.txt')
>>>
>>> d['Intestinal infectious diseases (001-003)']
{'003': 'Salmonella', '002': 'Fever', '001': 'Cholera'}
>>> d['Intestinal infectious diseases (001-003)']['002']
'Fever'
>>> d['Activity (E001-E002)']
{'E001': 'Activities involving x and y', 'E002': 'Other activities'}
>>> d['Activity (E001-E002)']['E001']
'Activities involving x and y'
>>>
>>> d
{'Activity (E001-E002)': {'E001': 'Activities involving x and y', 'E002': 'Other activities'}, 'External Cause Status (E000)': {'E000': 'External cause status'}, 'Intestinal infectious diseases (001-003)': {'003': 'Salmonella', '002': 'Fever', '001': 'Cholera'}, 'Zoonotic bacterial diseases (020-022)': {'021': 'Tularemia', '020': 'Plague', '022': 'Anthrax'}}
警告::文件中的第一行必须只是“ rootkey”!不是“子键”或数据!否则可能是引发错误:-)
注意事项:也许您应该删除第一个“ E”字符。还是不能做到?您是否需要将此“ E”字符留在某处?