我有2个制表符分隔文件 例如:
文件1:
12 23 43 34
433 435 76 76
file2的:
123 324 53 65
12 457 54 32
我想遍历这两个文件,将file1的每一行与file2进行比较,反之亦然。 例如,如果file1中的第1行的第1个数与文件2中的第2行的第1个数相同: 我想从file1的第一行放入一个名为output的文件中。 那么我想把file1中没有找到匹配的所有行放在一个新文件中 和file2中在file1中找不到匹配项的所有行。
到目前为止,我已经能够找到匹配的行并将它们放在一个文件中,但是我将无法匹配的行放入2个单独的文件中。
one=open(file1, 'r').readlines()
two=open(file2, 'r').readlines()
output=open('output.txt', 'w')
count=0
list1=[] #list for lines in file1 that didn't find a match
list2=[] #list for lines in file2 that didn't find a match
for i in one:
for j in two:
columns1=i.strip().split('\t')
num1=int(columns1[0])
columns2=j.strip().split('\t')
num2=int(columns2[0])
if num1==num2:
count+=1
output.write(i+j)
else:
list1.append(i)
list2.append(j)
我在这里遇到的问题是其他部分。 有人能告诉我正确的,更好的方法,我会非常感激。
编辑:感谢大家的快速回复 我想要的3输出是:
Output_file1:#Matching 2个文件之间的结果
12 23 43 34 #line from file1
12 457 54 32 #line from file2
Output_file2:来自第一个未找到匹配项的文件的#lines
433 435 76 76
Output_file3:第二个文件中没有找到匹配项的#lines
123 324 53 65
答案 0 :(得分:2)
我建议你使用csv模块来读取你的文件(你可能不得不乱用方言,请参阅http://docs.python.org/library/csv.html寻求帮助:
import csv
one = csv.reader(open(file1, 'r'), dialect='excell')
two = csv.reader(open(file2, 'r'), dialect='excell')
那么你可能会发现同时沿着两个文件的行“拉链”更容易(见http://docs.python.org/library/itertools.html#itertools.izip_longest):
import itertools
file_match = open('match', 'w')
file_nomatch1 = open('nomatch1', 'w')
file_nomatch2 = open('nomatch2', 'w')
for i,j in itertools.izip_longest(one, two, fillvalue="-"):
if i[0] == j[0]:
file_match.write(str(i)+'\n')
else:
file_nomatch1.write(str(i)+'\n')
file_nomatch2.write(str(j)+'\n')
# and maybe handle the case where one is "-"
我重读了帖子,发现你正在寻找两个文件中任意两行之间的匹配。也许有人会发现上面的代码很有用,但它不会解决你的特定问题。
答案 1 :(得分:2)
我建议使用set operation
from collections import defaultdict
def parse(filename):
result = defaultdict(list)
for line in open(filename):
# take the first number and put it in result
num = int(line.strip().split(' ')[0])
result[num].append(line)
return result
def select(selected, items):
result = []
for s in selected:
result.extend(items[s])
return result
one = parse('one.txt')
two = parse('two.txt')
one_s = set(one)
two_s = set(two)
intersection = one_s & two_s
one_only = one_s - two_s
two_only = two_s - one_s
one_two = defaultdict(list)
for e in one: one_two[e].extend(one[e])
for e in two: one_two[e].extend(two[e])
open('intersection.txt', 'w').writelines(select(intersection, one_two))
open('one_only.txt', 'w').writelines(select(one_only, one))
open('two_only.txt', 'w').writelines(select(two_only, two))
答案 2 :(得分:1)
我认为此代码符合您的目的
one=open(file1, 'r').readlines()
two=open(file2, 'r').readlines()
output=open('output.txt', 'w')
first = {x.split('\t')[0] for x in one}
second = {x.split('\t')[0] for x in two}
common = first.intersection( second )
list1 = filter( lambda x: not x.split('\t')[0] in common, one )
list2 = filter( lambda x: not x.split('\t')[0] in common, two )
res1 = filter( lambda x: x.split('\t')[0] in common, one )
res2 = filter( lambda x: x.split('\t')[0] in common, two )
count = len( res1 )
for x in range(count):
output.write( res1[x] )
output.write( res2[x] )
答案 3 :(得分:1)
认为这不是最好的方式,但它对我有用,看起来容易理解:
# Sorry but was not able to check code below
def get_diff(fileObj1, fileObj2):
f1Diff = []
f2Diff = []
outputData = []
# x is one row
f1Data = set(x.strip() for x in fileObj1)
f2Data = set(x.strip() for x in fileObj2)
f1Column1 = set(x.split('\t')[0] for x in f1Data)
f2Column1 = set(x.split('\t')[0] for x in f2Data)
l1Col1Diff = f1Column1 ^ f2Column1
l2Col1Diff = f2Column1 ^ f1Column1
commonPart = f1Column1 & f2column1
for line in f1Data.union(f2Data):
lineKey = line.split('\t')[0]
if lineKey in common:
outputData.append(line)
elif lineKey in l1ColDiff:
f1Diff.append(line)
elif lineKey in l2ColDiff:
f2Diff.append(line)
return outputData, f1Diff, f2Diff
outputData, file1Missed, file2Missed = get_diff(open(file1, 'r'), open(file2, 'r'))