Question

好吧，我接受标题对我的问题含糊不清，我无法以更易于理解的方式表达。我是编程新手，我的技术术语仍在发展中。

我有两个文件，文件A如下：

CHROM   POS ID  AGM12   AGM14   AGM15   AGM18 ..
1   14930   rs150145850     0/0 1/1 0/0  0/0 ..
1   14933   rs138566748 0/0 0/0 0/0  0/0 ..
1   63671   rs116440577 0/1 0/0 0/0  0/0 ..
2   808922  rs6594027   0/0 0/0 0/0  0/1 ..
2   753474  rs2073814   1/0 0/0 0/1  0/0 ..
3   753405  rs61770173  0/0 1/1 0/0  1/0 ..
...
...
...

档案B如下：

CHROM   POS rsID    Sample_ID
1   14930   rs150145850 AGM15
2   808922  rs6594027   AGM18
3   753405  rs61770173  AGM12
...
...
...

我希望使用文件B中的POS字段信息（第2列），将文件Sample_ID中相应的A内容替换为NA。

例如：输出应该看起来像

CHROM   POS ID  AGM12   AGM14   AGM15   AGM18
1   14930   rs150145850     0/0 1/1 NA   0/0
1   14933   rs138566748 0/0 0/0 0/0  0/0
1   63671   rs116440577 0/1 0/0 0/0  0/0
2   808922  rs6594027   0/0 0/0 0/0  NA
2   753474  rs2073814   1/0 0/0 0/1  0/0
3   753405  rs61770173  NA  1/1 0/0  1/0

我怎么能在Python或Unix中做到这一点？

Answer 1

这是一个使用csv模块的版本（我假设你的列是制表符分隔的。）

import csv
import collections

a = 'path/to/a'
b = 'path/to/b'
output = 'output/path'

pos = collections.defaultdict(list)

with open(b) as csvin:
    reader = csv.DictReader(csvin, delimiter='\t')
    for line in reader:
        pos[line['POS']].append(line['Sample_ID'])

with open(a) as csvin, open(output, 'wb') as csvout:
    reader = csv.DictReader(csvin, delimiter='\t')
    writer = csv.DictWriter(csvout, fieldnames=reader.fieldnames, delimiter='\t')
    writer.writeheader()
    for line in reader:
        fields = pos.get(line['POS'], [])
        for field in fields:
            line[field] = 'NA'
        writer.writerow(line)

Answer 2

试一试。

def method(file1, file2, fileout):
    d1, d2, headers = {}
    i = 1
    with open(file1) as f1:  
        for line in f1:
            vars = line.split('\t') #i am assuming tab seperated
            d1[vars[1]] = [vars[0]] + vars[2:]
    with open(file2) as f2:
        for line in f2:
            vars = line.split('\t')
            d2[vars[1]] = vars[2]
    for header in d1['POS']:
        headers[header] = i
        i+=1
    with open(fileout, 'w') as fo:
        fo.write("%s\tPOS\t%s\n" % (d1['POS'][0], "\t".join(d1['POS'][1:]))
        del d1['POS']         
        for key, values in d1.items():
            if key in d2:
                d1[key][headers[d2[key]]] = "NA"
            fo.write("%s\t%s\t%s\n" % (values[0], key, "\t".join(values[1:])))

Answer 3

如果您不介意安装某些软件包，可以使用pandas完全正确地执行此操作：

A = pandas.DataFrame.from_csv("A.txt", sep="\t", index_col=(0,1))
B = pandas.DataFrame.from_csv("B.txt", sep="\t", index_col=(0,1))

A.join(B) # the resulting dataset

当然，您必须选择pandas才能执行此操作。

查找文本列表并在匹配的字段中替换

3 个答案: