我正在尝试过滤掉具有与其他文件中的值匹配的特定值的行。我很感激你的帮助。
我的数据如下:
File1中:
Group Position Code Answer c1 c2 c3 c4
1 3 s1_60 A etc etc etc etc
2 4 s2_63 T etc2_ etc2 etc2/ etc2'
3 5 s1_23 A etc3 etc3 etc3* etc3
3 51 s7_52 T etc4 etc4_ etc4 etc4^
文件2:
>1
ATGCGCGCGCGCGATATATTGCTGATATATATGCCTTttaagatcaatat
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGCGAGAGAGAGAGAtgtgttgtagataGACGAG
>2
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCaaaaaaGAGAGAGAGAGAtgtgttgtagataGACG
>3
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGccggcgcgcgAGAtgtgttgtagataGACGAG
'组'是指'>'之后的数字。 on' File2'而位置'是指指定组别下的信件的位置。我想只保留具有来自' File2'的匹配字母的行。在'答案'列。
因此,输出将如下所示:
newOutput:
Group Position Code Answer c1 c2 c3 c4
2 4 s2_63 T etc2_ etc2 etc2/ etc2'
3 5 s1_23 A etc3 etc3 etc3* etc3
3 51 s7_52 T etc4 etc4_ etc4 etc4^
第一行' File1'不包括在内,因为它有' A'而不是' K'
我希望得到任何帮助。我正在考虑从awk或python开始。我从未组织过涉及多个文件的数据,因此对我来说有点令人沮丧。请告诉我。
答案 0 :(得分:1)
import csv
with open("File2") as infile:
d = {}
bases = ''
group = None
for line in infile:
line = line.strip()
if line.startswith(">"):
if group is not None:
d[group] = bases
group = int(line[1:])
bases = ''
continue
bases += line
d[group] = bases.upper()
with open("File1") as infile, open('output', 'w') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
writer.writerow(next(reader))
for g, pos, code, answer, *rest in reader:
g = int(g)
pos = int(pos)
if d[g][pos-1] == ans:
writer.writerow([g, pos, code, answer] + rest)
答案 1 :(得分:1)
这是一个awk解决方案:
BEGIN {
GROUP=1;
BASE=2;
}
NR == FNR {
positions[$1"_"$2]=toupper($3)
}
NR != FNR {
if($0 ~ /^>/) {
group=substr($0, 2, length($0));
} else {
gsub(" ", "", $0);
seqs[group]=seqs[group]$0;
}
}
END {
print "Group","Position","Answer"
for(current_group in seqs) {
for(key in positions) {
split(key,position,"_");
if(position[GROUP] == current_group) {
if(toupper(substr(seqs[group],position[BASE],1)) \
== positions[key]) {
print position[GROUP],
position[BASE],
positions[key];
}
}
}
}
}
awk -f script.awk File1 File2
输出:
Group Position Answer
2 4 T
3 5 A
第3组的第51位似乎是G
,而不是T
,因此我的输出与您的输出不同。