我有一个带有100k行的制表符分隔文件:(已编辑)
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
我希望ID两个方向匹配并且有1A-1B,1B-1A,保存在一个文件中,其余的保存在另一个文件中,所以:
out.txt) PROT1B2 PROT1A1 rest.txt) PROT1A5 PROT1B6
PROT1A1 PROT1B2 PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
我的脚本为我提供了双向ID,但我不知道要找到具体的模式,重新?如果您对您的脚本发表评论我很感激,所以我可以理解并修改它。
fileA = open("input.txt",'r')
fileB = open("input_copy.txt",'r')
output = open("out.txt",'w')
out2=open("rest.txt",'w')
dictA = dict()
for line1 in fileA:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
dictA[query] = subject
dictB = dict()
for line1 in fileB:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
dictB[query] = subject
SharedPairs ={}
NotSharedPairs ={}
for id1 in dictA.keys():
value1=dictA[id1]
if value1 in dictB.keys():
if id1 == dictB[value1]: # may be re should go here?
SharedPairs[value1] = id1
else:
NotSharedPairs[value1] = id1
for key in SharedPairs.keys():
line = key +'\t' + SharedPairs[key]+'\n'
output.write(line)
for key in NotSharedPairs.keys():
line = key +'\t' + NotSharedPairs[key]+'\n'
out2.write(line)
答案 0 :(得分:1)
对于最终的规范,这里有答案建议:
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f: # read file in context secured block
for line in i_f:
line = line.rstrip() # no trailing white space (includes \n et al.)
if line:
pair = [a.strip() for a in line.split('\t')] # fragile split
a_dict[pair[0]] = pair[1]
b_dict[pair[1]] = pair[0]
must_match = sorted(('1A', '1B')) # Accept only unordered 1A-1B pairs
fix_pos_slice = slice(4, 6) # Sample: 'PROT1A1' has '1A' on slice(4, 6)
shared, rest = {}, {}
for key_a, val_a in a_dict.items():
# Below we prepare a canidate for matching against the unordered pair
fix_pos_match_cand = sorted(x[fix_pos_slice] for x in (key_a, val_a))
if must_match == fix_pos_match_cand and b_dict.get(key_a) == val_a:
shared[val_a] = key_a
else:
rest[val_a] = key_a
# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f: # Again secured in context block
for key, val in data.items():
o_f.write(key + '\t' + val + '\n')
对给定(新)input.txt
进行操作(期望一行中的单词之间的标签!):
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
收益率为out.txt
:
PROT1A1 PROT1B2
PROT1B2 PROT1A1
和rest.txt
:
PROT1B6 PROT1A5
PROT2B2 PROT2A1
PROT3B2 PROT1A2
PROT1A2 PROT3B2
已添加评论以突出显示某些代码部分。
要使用相同的输入文件(但演示不同的结果),在允许的匹配中添加一个假设的1A-3B对(输入中不存在2A-2B)就像这样(一个解决方案):
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f: # read file in context secured block
for line in i_f:
line = line.rstrip() # no trailing white space (includes \n et al.)
if line:
pair = [a.strip() for a in line.split('\t')] # fragile split
a_dict[pair[0]] = pair[1]
b_dict[pair[1]] = pair[0]
must_match_once = sorted( # Accept only unordered 1A-1B or 1A-3B pairs
(sorted(pair) for pair in (('1A', '1B'), ('1A', '3B'))))
fix_pos_slice = slice(4, 6) # Sample: 'PROT1A1' has '1A' on slice(4, 6)
shared, rest = {}, {}
for key_a, val_a in a_dict.items():
# Below we prepare a canidate for matching against the unordered pair
fix_pos_match_cand = sorted(x[fix_pos_slice] for x in (key_a, val_a))
has_match = any(
[must_match == fix_pos_match_cand for must_match in must_match_once])
if has_match and b_dict.get(key_a) == val_a:
shared[val_a] = key_a
else:
rest[val_a] = key_a
# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f: # Again secured in context block
for key, val in data.items():
o_f.write(key + '\t' + val + '\n')
对给定(相同)input.txt
进行操作(注意一行中单词之间仍然需要标签!):
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
收益率为out.txt
:
PROT1A1 PROT1B2
PROT1B2 PROT1A1
PROT3B2 PROT1A2
PROT1A2 PROT3B2
和rest.txt
:
PROT1B6 PROT1A5
PROT2B2 PROT2A1
像往常一样,现实生活会引入"重复",所以根据OP的特殊要求, 这里有一个最终变体,处理重复的第一或第二列标记。
也可以将整个对存储为一组中的字符串,但保持输入的dict(使用漂亮的setdefault方法并定位值列表)是一种更适合的方法IMO。
在下面的变体中,另外两件事情与其他事情有所不同:
处理的样本(部分)重复数据:
PROT1A1 PROT1B1
PROT1A1 PROT2B1
源代码:
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f: # read file in context secured block
for line in i_f:
line = line.rstrip() # no trailing white space (includes \n et al.)
if line:
pair = [a.strip() for a in line.split('\t')] # fragile split
# Build a dict with list as values
# ... to keep same key, different value pairs
a_dict.setdefault(pair[0], []).append(pair[1])
b_dict.setdefault(pair[1], []).append(pair[0])
must_match_once = sorted( # Accept only unordered 1A-1B or 2A-2B pairs
(sorted(pair) for pair in (('1A', '1B'), ('1A', '3B'))))
fix_pos_slice = slice(4, 6) # Sample: 'PROT1A1' has '1A' on slice(4, 6)
shared, rest = [], [] # Store respective output in lists of pairs (tuples)
for key_a, seq_val_a in a_dict.items():
for val_a in seq_val_a:
# Below we prepare a canidate for matching against the unordered pair
fix_pos_mc = sorted(x[fix_pos_slice] for x in (key_a, val_a))
has_match = any(
[must_match == fix_pos_mc for must_match in must_match_once])
if has_match and b_dict.get(key_a) and val_a in b_dict.get(key_a):
# Preserve first, second source order by appending in that order
shared.append((key_a, val_a))
else:
rest.append((key_a, val_a))
# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f: # Again secured in context block
for pair in data:
o_f.write('\t'.join(pair) + '\n')
对给定(退化)input.txt
进行操作(注意仍然期望一行中单词之间的标签!):
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A1 PROT2B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
收益率为out.txt
:
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
和rest.txt
:
PROT1A1 PROT2B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
因此,"重复":PROT1A1 PROT2B2
会被保留,并且不会影响搜索到的PROT1A1 PROT1B2
。
以前的更新:
现在通过评论更好地指定任务(也许还可以更新问题):
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
fix_pos_match_query = sorted(('1A', '1B'))
fix_pos_slice = slice(4, 6) # sample 'PROT1A1' has '1A' on slice(4, 6)
with open(a_file_name, "rt") as i_f:
for line in i_f:
line = line.rstrip('\n')
if line:
pair = [a.strip() for a in line.split('\t')]
fix_pos_match_cand = sorted(x[fix_pos_slice] for x in pair)
if fix_pos_match_query == fix_pos_match_cand:
a_dict[pair[0]] = pair[1]
b_dict[pair[1]] = pair[0]
shared, rest = {}, {}
for key_a, val_a in a_dict.items():
if b_dict.get(key_a) == val_a:
shared[val_a] = key_a
else:
rest[val_a] = key_a
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f:
for key, val in data.items():
o_f.write(key + '\t' + val + '\n')
按给定的input.txt
操作:
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
收益率为out.txt
:
PROT1A1 PROT1B2
PROT1B2 PROT1A1
和rest.txt
:
PROT1B6 PROT1A5
这可以通过从样本和评论中理解的确切位置预期给定的2个字符来起作用。
如果这是具有进一步变异的蛋白质结构域特定任务(例如,不匹配固定位置),则正则表达式肯定会更好,但问题不能提供这样的可变性。如果需要,只需在输入行上替换过滤器,使用正则表达式匹配(或不匹配)的过滤器。
<强>旧强>:
使用简单的完全匹配过滤器的第一个答案:
一个答案的试验 - 或者至少有一些关于如何编写代码的提示,以及其他人可读的代码 - 以及基于输入时所请求的特定对的存在的简单过滤器:
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
filter_on_read = sorted(('PROT1B2', 'PROT1A1'))
with open(a_file_name, "rt") as i_f:
for line in i_f:
line = line.rstrip('\n')
if line:
pair = [a.strip() for a in line.split('\t')]
if filter_on_read == sorted(pair):
a_dict[pair[0]] = pair[1]
b_dict[pair[1]] = pair[0]
shared, rest = {}, {}
for key_a, val_a in a_dict.items():
if b_dict.get(key_a) == val_a:
shared[val_a] = key_a
else:
rest[val_a] = key_a
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f:
for key, val in data.items():
o_f.write(key + '\t' + val + '\n')
在我的机器上给出:
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
收益率为out.txt
:
PROT1A1 PROT1B2
PROT1B2 PROT1A1
并且在rest.txt
中没有(对于此输入),因为其余部分已经过滤掉了。
请注意:特别是会有更优雅的版本。当阅读大文件时...我们可能会错过数据,因为我们只是 - 像问题代码一样 - 在映射的一侧循环从第一个到第二个条目的映射,如果有重复的第一个条目但是具有不同的第二个条目将覆盖先前读取的数据,前者将永远不会输出。
因此b_dict
中的条目可能会在输出文件中看不到。
HTH python是一种很棒的语言; - )