我有三个文本文件:
fileA:
13 abc
123 def
234 ghi
1234 jkl
12 mno
FILEB:
12 abc
12 def
34 qwe
43 rty
45 mno
fileC:
12 abc
34 sdg
43 yui
54 poi
54 def
我想看看第二列中的所有值在文件之间是否匹配。如果第二列已经排序,则以下代码有效。但如果第二列没有排序,我如何排序第二列并比较文件?
fileA = open("A.txt",'r')
fileB = open("B.txt",'r')
fileC = open("C.txt",'r')
listA1 = []
for line1 in fileA:
listA = line1.split('\t')
listA1.append(listA)
listB1 = []
for line1 in fileB:
listB = line1.split('\t')
listB1.append(listB)
listC1 = []
for line1 in fileC:
listC = line1.split('\t')
listC1.append(listC)
for key1 in listA1:
for key2 in listB1:
for key3 in listC1:
if key1[1] == key2[1] and key2[1] == key3[1] and key3[1] == key1[1]:
print "Common between three files:",key1[1]
print "Common between file1 and file2 files:"
for key1 in listA1:
for key2 in listB1:
if key1[1] == key2[1]:
print key1[1]
print "Common between file1 and file3 files:"
for key1 in listA1:
for key2 in listC1:
if key1[1] == key2[1]:
print key1[1]
答案 0 :(得分:3)
如果您只想在第二列排序A1
,B1
和C1
,这很容易:
listA1.sort(key=operator.itemgetter(1))
如果您不理解itemgetter
,则相同:
listA1.sort(key=lambda element: element[1])
但是,我认为更好的解决方案是使用set
:
setA1 = set(element[1] for element in listA1)
setB1 = set(element[1] for element in listB1)
setC1 = set(element[1] for element in listC1)
或者更简单地说,不要首先建立列表;这样做:
setA1 = set()
for line1 in fileA:
listA = line1.split('\t')
setA1.add(listA[1])
无论哪种方式:
print "Common between file1 and file2 files:"
for key in setA1 & setA2:
print key
为了进一步简化,您可能希望首先将重复的内容重构为函数:
def read_file(path):
with open(path) as f:
result = set()
for line in f:
columns = line.split('\t')
result.add(columns[1])
return result
setA1 = read_file('A.txt')
setB1 = read_file('B.txt')
setC1 = read_file('C.txt')
然后你可以找到更多的机会。例如:
def read_file(path):
with open(path) as f:
return set(row[1] for row in csv.reader(f))
正如John Clements所指出的那样,你甚至不需要将它们全部三个都设置为A1,所以你可以这样做:
def read_file(path):
with open(path) as f:
for row in csv.reader(f):
yield row[1]
setA1 = set(read_file('A.txt'))
iterB1 = read_file('B.txt')
iterC1 = read_file('B.txt')
您需要的唯一其他更改是,您必须致电intersection
而不是使用&
运算符,因此:
for key in setA1.intersection(iterB1):
我不确定这最后的改变实际上是一种改进。但是在Python 3.3中,您唯一需要做的就是将return set(…)
更改为yield from (…)
,我可能会 这样做。 (即使文件很庞大并且有大量的重复项,因此它的性能成本也很高,我只会在unique_everseen
调用的itertools
个方法中添加read_file
。)