在特定条件下过滤掉行

时间:2014-11-03 04:56:36

标签: python awk

我正在尝试过滤掉具有与其他文件中的值匹配的特定值的行。我很感激你的帮助。

我的数据如下:

File1中:

  Group   Position Code     Answer  c1     c2    c3    c4   
  1       3        s1_60    A       etc    etc   etc   etc
  2       4        s2_63    T       etc2_  etc2  etc2/ etc2'
  3       5        s1_23    A       etc3   etc3  etc3* etc3
  3       51       s7_52    T       etc4   etc4_ etc4  etc4^

文件2:

>1
ATGCGCGCGCGCGATATATTGCTGATATATATGCCTTttaagatcaatat
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGCGAGAGAGAGAGAtgtgttgtagataGACGAG
>2
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCaaaaaaGAGAGAGAGAGAtgtgttgtagataGACG
>3
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGccggcgcgcgAGAtgtgttgtagataGACGAG

'组'是指'>'之后的数字。 on' File2'而位置'是指指定组别下的信件的位置。我想只保留具有来自' File2'的匹配字母的行。在'答案'列。

因此,输出将如下所示:

newOutput:

Group   Position  Code      Answer  c1     c2    c3    c4
  2       4        s2_63    T       etc2_  etc2  etc2/ etc2'
  3       5        s1_23    A       etc3   etc3  etc3* etc3
  3       51       s7_52    T       etc4   etc4_ etc4  etc4^

第一行' File1'不包括在内,因为它有' A'而不是' K'

我希望得到任何帮助。我正在考虑从awk或python开始。我从未组织过涉及多个文件的数据,因此对我来说有点令人沮丧。请告诉我。

2 个答案:

答案 0 :(得分:1)

import csv

with open("File2") as infile:
    d = {}
    bases = ''
    group = None
    for line in infile:
        line = line.strip()
        if line.startswith(">"):
            if group is not None:
                d[group] = bases
            group = int(line[1:])
            bases = ''
            continue
        bases += line
    d[group] = bases.upper()

with open("File1") as infile, open('output', 'w') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    writer.writerow(next(reader))
    for g, pos, code, answer, *rest in reader:
        g = int(g)
        pos = int(pos)
        if d[g][pos-1] == ans:
            writer.writerow([g, pos, code, answer] + rest)

答案 1 :(得分:1)

这是一个awk解决方案:

BEGIN {
    GROUP=1;
    BASE=2;
}
NR == FNR {
    positions[$1"_"$2]=toupper($3)
}

NR != FNR {
    if($0 ~ /^>/) {
        group=substr($0, 2, length($0));
    } else {
        gsub(" ", "", $0);
        seqs[group]=seqs[group]$0;
    }
}

END {
    print "Group","Position","Answer"
    for(current_group in seqs) {
        for(key in positions) {
            split(key,position,"_");
            if(position[GROUP] == current_group) {
                if(toupper(substr(seqs[group],position[BASE],1)) \
                        == positions[key]) {
                    print position[GROUP],
                          position[BASE],
                          positions[key];
                }
            }
        }
    }
}

awk -f script.awk File1 File2

输出:

Group Position Answer
2 4 T
3 5 A

第3组的第51位似乎是G,而不是T,因此我的输出与您的输出不同。