我想比较相同行的7个不同文件,并显示多个文件中存在的条目。例如
file1:
ID123 columns with info
ID456 columns with info
ID789 columns with info
file 2:
ID123 columns with info
ID999 columns with info
ID888 columns with info
file3:
ID999 columns with info
ID123 columns with info
ID555 columns with info
然后我想打印/显示类似的东西:
file1 and file2 and file3: ID123
file2 and file3: ID999, ID123
我已经有这样的事情:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
但在这种情况下,我想比较7个文件。另外,它是一个制表符分隔文件,所以我想比较每个文件的第一列,并记下重复项。我想我需要一个
for i in excelList[1:]:
newlist = newlist.append(i.split("\t")[0])
或类似的东西。即使我制作了7个列表,也很难用" .intersection"来比较它们。代码。
有没有更容易的方法来实现这个目标?
答案 0 :(得分:1)
您可以将dict
映射ID用于文件名列表:
from collections import defaultdict
id_to_files = defaultdict(list)
for filename in filenames:
with open(filename, "rb") as f:
reader = csv.reader(f, delim="\t", ...)
for row in reader:
id = row[0]
id_to_files[id].append(filename)
所以你会得到类似的东西:
print(id_to_files)
{
"ID123": ["file1", "file2", "file3"],
"ID999": ["file2", "file3"],
"ID888": ["file2"],
"ID555": ["file3"],
"ID456": ["file1"],
"ID789": ["file1"],
}
然后您可以过滤以删除列出单个文件的条目(因为这些不重复):
duplicates = {k:v for k, v in id_to_files.iteritems() if len(v) > 1}
print(duplicates)
{
"ID123": ["file1", "file2", "file3"],
"ID999": ["file2", "file3"],
}
然后,根据确切的所需输出,您最终可能需要构建第二个映射,其中包含最适合输出格式的映射...例如反向映射:
revduplicates = defaultdict(list)
for k, v in duplicates.iteritems():
revduplicates[tuple(v)].append(k)
print(revduplicates)
{
('file1', 'file2', 'file3'): ['ID123'],
('file2', 'file3'): ['ID999'],
}
对于您描述的确切输出,您将有更多步骤,但这至少应该让您开始。