Question

我正在尝试过滤掉具有与其他文件中的值匹配的特定值的行。我很感激你的帮助。

我的数据如下：

File1中：

  Group   Position Code     Answer  c1     c2    c3    c4   
  1       3        s1_60    A       etc    etc   etc   etc
  2       4        s2_63    T       etc2_  etc2  etc2/ etc2'
  3       5        s1_23    A       etc3   etc3  etc3* etc3
  3       51       s7_52    T       etc4   etc4_ etc4  etc4^

文件2：

>1
ATGCGCGCGCGCGATATATTGCTGATATATATGCCTTttaagatcaatat
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGCGAGAGAGAGAGAtgtgttgtagataGACGAG
>2
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCaaaaaaGAGAGAGAGAGAtgtgttgtagataGACG
>3
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGccggcgcgcgAGAtgtgttgtagataGACGAG

＆＃39;组＆＃39;是指＆＃39;＆gt;＆＃39;之后的数字。 on＆＃39; File2＆＃39;而位置＆＃39;是指指定组别下的信件的位置。我想只保留具有来自＆＃39; File2＆＃39;的匹配字母的行。在＆＃39;答案＆＃39;列。

因此，输出将如下所示：

newOutput：

Group   Position  Code      Answer  c1     c2    c3    c4
  2       4        s2_63    T       etc2_  etc2  etc2/ etc2'
  3       5        s1_23    A       etc3   etc3  etc3* etc3
  3       51       s7_52    T       etc4   etc4_ etc4  etc4^

第一行＆＃39; File1＆＃39;不包括在内，因为它有＆＃39; A＆＃39;而不是＆＃39; K＆＃39;

我希望得到任何帮助。我正在考虑从awk或python开始。我从未组织过涉及多个文件的数据，因此对我来说有点令人沮丧。请告诉我。

Answer 1

import csv

with open("File2") as infile:
    d = {}
    bases = ''
    group = None
    for line in infile:
        line = line.strip()
        if line.startswith(">"):
            if group is not None:
                d[group] = bases
            group = int(line[1:])
            bases = ''
            continue
        bases += line
    d[group] = bases.upper()

with open("File1") as infile, open('output', 'w') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    writer.writerow(next(reader))
    for g, pos, code, answer, *rest in reader:
        g = int(g)
        pos = int(pos)
        if d[g][pos-1] == ans:
            writer.writerow([g, pos, code, answer] + rest)

Answer 2

这是一个awk解决方案：

BEGIN {
    GROUP=1;
    BASE=2;
}
NR == FNR {
    positions[$1"_"$2]=toupper($3)
}

NR != FNR {
    if($0 ~ /^>/) {
        group=substr($0, 2, length($0));
    } else {
        gsub(" ", "", $0);
        seqs[group]=seqs[group]$0;
    }
}

END {
    print "Group","Position","Answer"
    for(current_group in seqs) {
        for(key in positions) {
            split(key,position,"_");
            if(position[GROUP] == current_group) {
                if(toupper(substr(seqs[group],position[BASE],1)) \
                        == positions[key]) {
                    print position[GROUP],
                          position[BASE],
                          positions[key];
                }
            }
        }
    }
}

awk -f script.awk File1 File2

输出：

Group Position Answer
2 4 T
3 5 A

第3组的第51位似乎是G，而不是T，因此我的输出与您的输出不同。

在特定条件下过滤掉行

2 个答案: