我刚刚开始对任何困惑感到遗憾。
我有2个文件。文件A包含我感兴趣的样本名称列表。文件B包含所有样本的数据。
File A (no headers)
sample_A
sample_XA
sample_12754
samples_75t
File B
name description etc .....
sample_JA mm 0.01 0.1 1.2 0.018 etc
sample_A mm 0.001 1.2 0.8 1.4 etc
sample_XA hu 0.4 0.021 0.14 2.34 etc
samples_YYYY RN 0.0001 3.435 1.1 0.01 etc
sample_12754 mm 0.1 0.1 0.87 0.54 etc
sample_2248333 hu 0.43 0.01 0.11 2.32 etc
samples_75t mm 0.3 0.02 0.14 2.34 etc
我想将文件A与文件B进行比较并从B输出数据,但仅针对A中列出的样本名称。
我试过了。
#!/usr/bin/env python2
import csv
count = 0
import collections
samples = collections.defaultdict(list)
with open('FILEA.txt') as d:
sites = [l.strip() for l in f if l.strip()]
###This gives me the correct list of samples for file A.
with open('FILEB','r') as inF:
for line in inF:
elements = line.split()
if sites.intersection(elements):
count += 1
print (elements)
##这里我得到文件B中所有样本的名称,只有名称。我想要文件B中的数据,但只需要A中的样本。
然后我尝试使用和交叉。
#!/usr/bin/env python2
import sys
import csv
import collections
samples = collections.defaultdict(list)
with open('FILEA.txt','r') as f:
nsamples = [l.strip() for l in f if l.strip()]
print (nsamples)
with open ('FILEB','r') as inF:
for row in inF:
elements = row.split()
if nsamples.intersection(elements):
print(row[0,:])
仍然无效。
What do I have to do to get the output data as follows:
name description etc .....
sample_A mm 0.001 1.2 0.8 1.4 etc
sample_XA hu 0.4 0.021 0.14 2.34 etc
sample_12754 mm 0.1 0.1 0.87 0.54 etc
sample_75t mm 0.3 0.02 0.14 2.34 etc
任何想法都将非常感激。感谢。
答案 0 :(得分:3)
从filea
创建一组行,然后将每行从fileb
拆分一次,看看第一个元素是否在filea
的数据集中:
with open("filea") as f, open("fileb") as f2:
# male set of lines stripping newlines
# so we can compare properly later i.e foo\n != foo
st = set(map(str.rstrip, f)) # itertools.imap python2
for line in f2:
# split once and extract first element to compare
if line.strip() and line.split(None, 1)[0] in st:
print(line.rstrip())
输出:
sample_A mm 0.001 1.2 0.8 1.4 etc
sample_XA hu 0.4 0.021 0.14 2.34 etc
sample_12754 mm 0.1 0.1 0.87 0.54 etc
samples_75t mm 0.3 0.02 0.14 2.34 etc