我有一个包含以下数据的文件:
line EF1 1 F Flu 5.7 3.221 9.332
line A2 1 C Car 3.2 5.22 1.22
line A1 1 C Car 3.11 4.21 2.13
line HF1 1 H Hyd 7.11 5.11 7.11
line EE2 1 F Flu 5.7 3.221 9.332
line A2 2 C Car 3.2 5.22 1.22
line EF1 2 F Flu 5.7 3.221 9.332
line EE2 2 F Flu 5.7 3.221 9.332
line A1 2 C Car 3.11 4.21 2.13
line HE2 2 H Hyd 7.11 5.11 7.11
...... 1000多行。
此处第3列表示链编号。
现在我创建了名为EF
,EE
,H
和ace
的不同列表。
我想要做的是,如果EF1
和HE1
都来自同一chain number
,那么请在EF data
和'EF list'
中HE data
写H list
1}}。另一方面,如果同一'EF1'
中仅存在HE1
但不存在chain number
,则将其写入'ace list'
。
所需的输出是:
EF list: line EF1 1 F Flu 5.7 3.221 9.332
line EE2 2 F Flu 5.7 3.221 9.332
H list: line HF1 1 H Hyd 7.11 5.11 7.11
line HE2 2 H Hyd 7.11 5.11 7.11
ace list: line EE2 1 F Flu 5.7 3.221 9.332
line EF1 2 F Flu 5.7 3.221 9.332
现在我想尝试,
inp = filename.read().strip().split('\n')
for line in map(str.split,inp):
codeName = line[1]
shortName = line[3]
现在作为一个新手,我真的迷失在这里,我怎么能够构建一个if loop
来完成这项检查。
请提供一些关于我如何在这方面取得进展的想法!!
(我第一次误认为是格式化错误。纠正了它!)
答案 0 :(得分:1)
您的代码需要看起来更像这样:
with open(filename) as inp:
for line in inp:
tokens = line.split()
codeName = tokens[1]
shortName = tokens[3]
你根本没有打开文件,地图()也没有真正帮助你。
答案 1 :(得分:0)
我认为你真的不想在那里使用for循环......
inp = filename.read().strip().split('\n')
inp = [line.split() for line in inp]
# sorts the input lines by the second column, so that groups appear together in order
inp = sorted(inp, key=itemgetter(1))
现在你有一个行列表,分成列
接下来你想把它们分成第二列相同的行块,对吗?
cur_group = 1
groups_list = []
group = []
for line in inp:
if int(line[1]) == cur_group:
group.append(line)
else:
groups_list.append(group)
group = [line]
cur_group += 1
现在你有一个组列表,每个组都是一个行列表;您的数据如下:
groups_list = [[['EF1', '1', 'F', 'Flu', '5.7', '3.221', '9.332'],
...],
[['A2', '2', ... ],
...],
...
]
现在您可以查看您真正想知道的内容,即每个组中的每个EF是否都有匹配的EH。我将为此创建一个辅助函数:
def find_match(line, group, EH_list, EF_list):
"""
returns false if no match found, returns true and appends line and match to appropriate lists otherwise
"""
for pmatch in group:
if line[0].startswith('EH') and pmatch[0].startswith('EF') and pmatch[0][2]==line[0][2]: # Match case 1
EH_list.append(line)
EF_list.append(pmatch)
return True
elif line[0].startswith('EF') and pmatch[0].startswith('EH') and pmatch[0][2]==line[0][2]: # Match case 2
EF_list.append(line)
EH_list.append(pmatch)
return True
else:
return False
然后其余部分简单而且相当直接:
for group in groups_list:
for line in group:
if line[0][0] == 'E' and not find_match(line, group, EH, EF):
ace.append(line)
......我认为应该是它!我不会做出任何承诺,这个代码将立即运行,但它应该至少给你一个开始的好地方
答案 2 :(得分:0)
您说订单是随机的,因此您无法通过逐行浏览文件来查看是否存在一对条目。相反,你需要记住所有未配对的条目,以便在看到其他部分后可以检查它的存在。
所以我会记住词典中不匹配的条目,其中Ex / Hx部分和链号是关键。如果对于一行,配对条目在不匹配的字典中,我可以从那里删除它并将其添加到正确的列表中。否则我将该行添加到字典本身。
所有未被结束消费的不匹配条目自动成为ace
的条目。
# initialize target lists and unmatched dictionary
E_list, H_list = [], []
unmatched = {}
with open('filename') as f:
# loop over each line in the file
for line in f:
# split the line into parts separated by whitespace
data = tuple(line.split())
# `key` is the second column, `chain` the third
key, chain = data[1], data[2]
# `key` begins with an 'E'
if key.startswith('E'):
# The key of the paired value begins with an 'H' instead of 'E'
pairKey = 'H' + key[1:]
# The list for the current item will be `E_list`; the list for
# the paired element will be `H_list`
curList, pairList = E_list, H_list
# `key` begins with an 'H'
elif key.startswith('H'):
# The key of the paired value begins with an 'E' instead of 'H'
pairKey = 'E' + key[1:]
# The list for the current item will be `H_list`; the list for
# the paired element will be `E_list`
curList, pairList = H_list, E_list
# `key` stats with neither 'E' nor 'H', so skip this line
else:
continue
# At this point we know that the current line has `key` as its
# key, and `chain` as its chain. The element that should be paired
# with it has `pairKey` as its key, and also `chain` as its chain.
# If we have matched the paired element before, it will be in the
# `unmatched` dictionary; if that’s the case, put the current element
# into the list `curList`, and the paired element into the list
# `pairList`.
# Look up the paired element from the unmatched dictionary
pair = unmatched.get((pairKey, chain), None)
if pair:
# If we found it, append the current and paired element to their
# correct list …
curList.append(data)
pairList.append(pair)
# … and remove the paired element from the unmatched set
del unmatched[(pairKey, chain)]
else:
# Otherwise, if we didn’t found it, remember this item to be
# paired with something later
unmatched[(key, chain)] = data
# Finally, collect all elements that haven’t been matched yet, and
# put it into the `ace` list
ace = list(unmatched.values())
与您的示例数据一起使用,产生:
>>> for l in E_list: print(l)
('line', 'EF1', '1', 'F', 'Flu', '5.7', '3.221', '9.332')
('line', 'EE2', '2', 'F', 'Flu', '5.7', '3.221', '9.332')
>>> for l in H_list: print(l)
('line', 'HF1', '1', 'H', 'Hyd', '7.11', '5.11', '7.11')
('line', 'HE2', '2', 'H', 'Hyd', '7.11', '5.11', '7.11')
>>> for l in ace: print(l)
('line', 'EF1', '2', 'F', 'Flu', '5.7', '3.221', '9.332')
('line', 'EE2', '1', 'F', 'Flu', '5.7', '3.221', '9.332')
我从您的描述中不确定的一件事是“密钥”(EF / EH +编号)和链编号是否足以唯一地识别该对。如果情况并非如此,那么您可能想要更改我用作字典键的数据类型。
答案 3 :(得分:0)
我认为关键部分是如何识别哪种形式对。我使用了以下代码片段:
splitline = line.strip().split()
identifier = "".join(splitline[1:3])[1:]
# You could also write the following, if that makes it more clear:
identifier = splitline[1][1:] + splitline[2]
基本上"".join(splitline[1:3])[1:]
做的是从HF1 1
制作标识符字符串F11
(它省略了第一个字符),如果它只出现在那里,它基本上应该被测试的东西“F”或“H”(反之亦然)。
在示例中,EF1 with chain 1
和HF1 with chain 1
都会生成该标识符F11
。第一次出现其中一个时,它设置categories['F11'] = 1
。当找到该对时,它会设置categories['F11'] = 2
使用它,我们构建一个字典来存储这些标识符的结果,如果它们出现一次或两次。
如果在categories
中标识符的值为1
,那么我们知道该行应放在ace
中,如果值为2
,我们知道应该将条目写入F
或H
。
当我们使用词典时,这个解决方案会非常快;如果你想维护列表中的顺序,请告诉我,然后我可以相应地更新。
所以这是代码:
infile = "overflow.txt"
result = {"F" : [], "H" : [], "ace" : []}
with open(infile) as f:
# First pass through to build the dictionary of identifiers
categories = {}
for line in f:
splitline = line.strip().split()
identifier = "".join(splitline[1:3])[1:]
if identifier not in categories:
categories[identifier] = 1
else:
# If the identifier is already there, the value becomes 2.
categories[identifier] = 2
# To go for a second pass through to create the lists
f.seek(0)
for line in f:
splitline = line.strip().split()
identifier = "".join(splitline[1:3])[1:]
if categories[identifier] == 1: # meaning the stem just occurs once
if splitline[3] != "C":
result["ace"].append(line.strip())
else:
result[splitline[3]].append(line.strip())
您现在可以result["H"]
,result["F"]
和result["ace"]
访问这些群组。
以下是打印结果的代码:
for type in result:
print("\n",type, "list:", "\n", "\n ".join(result[type]))
H list: line HF1 1 H Hyd 7.11 5.11 7.11 line HE2 2 H Hyd 7.11 5.11 7.11 F list: line EF1 1 F Flu 5.7 3.221 9.332 line EE2 2 F Flu 5.7 3.221 9.332 ace list: line EE2 1 F Flu 5.7 3.221 9.332 line EF1 2 F Flu 5.7 3.221 9.332