比较来自2个文件

时间:2015-11-29 20:03:56

标签: python compare

我刚刚开始对任何困惑感到遗憾。

我有2个文件。文件A包含我感兴趣的样本名称列表。文件B包含所有样本的数据。

File A (no headers)

sample_A
sample_XA
sample_12754
samples_75t

File B

name                  description      etc .....
sample_JA                mm           0.01         0.1     1.2      0.018  etc
sample_A                 mm           0.001        1.2     0.8      1.4    etc
sample_XA                hu           0.4          0.021   0.14     2.34   etc
samples_YYYY             RN           0.0001       3.435   1.1      0.01   etc
sample_12754             mm           0.1          0.1     0.87     0.54   etc
sample_2248333           hu           0.43         0.01    0.11     2.32   etc
samples_75t              mm           0.3          0.02    0.14     2.34   etc

我想将文件A与文件B进行比较并从B输出数据,但仅针对A中列出的样本名称。

我试过了。

#!/usr/bin/env python2

import csv

count = 0

import collections
samples = collections.defaultdict(list)
with open('FILEA.txt') as d:
sites = [l.strip() for l in f if l.strip()]      

###This gives me the correct list of samples for file A.

with open('FILEB','r') as inF:
   for line in inF:
       elements = line.split()
       if sites.intersection(elements):
          count += 1

          print (elements)

##这里我得到文件B中所有样本的名称,只有名称。我想要文件B中的数据,但只需要A中的样本。

然后我尝试使用和交叉。

#!/usr/bin/env python2

 import sys
 import csv
 import collections

 samples = collections.defaultdict(list)
 with open('FILEA.txt','r') as f:
   nsamples = [l.strip() for l in f if l.strip()] 

 print (nsamples)

 with open ('FILEB','r') as inF:
   for row in inF:
     elements = row.split()
     if nsamples.intersection(elements):
        print(row[0,:])

仍然无效。

What do I have to do to get the output data as follows:
name                  description      etc .....
sample_A                 mm           0.001        1.2     0.8       1.4   etc
sample_XA                hu           0.4          0.021   0.14      2.34  etc
sample_12754             mm           0.1          0.1     0.87      0.54  etc
sample_75t               mm           0.3          0.02    0.14      2.34  etc

任何想法都将非常感激。感谢。

1 个答案:

答案 0 :(得分:3)

filea创建一组行,然后将每行从fileb拆分一次,看看第一个元素是否在filea的数据集中:

with open("filea") as f, open("fileb") as f2:
    # male set of lines stripping newlines
    # so we can compare properly later i.e foo\n != foo
    st  = set(map(str.rstrip, f)) # itertools.imap python2
    for line in f2:
        # split once and extract first element to compare
        if line.strip() and line.split(None, 1)[0] in st:
            print(line.rstrip())

输出:

sample_A                 mm           0.001        1.2     0.8      1.4    etc
sample_XA                hu           0.4          0.021   0.14     2.34   etc
sample_12754             mm           0.1          0.1     0.87     0.54   etc
samples_75t              mm           0.3          0.02    0.14     2.34   etc