我想使用AWK,但似乎没有正确记录第一张唱片。我希望任何人都可以帮助完成它。
我有这个文件,每条记录是3行,但有时它有4行(所以有$ 3和$ 4)。我的目标是打印每条记录的所有三行,如果有第四行,我还要打印前两行和第四行(不包含第三行)。
我的策略是使用字符串(“ Sequence:”)作为RS,并在FS中使用新行(“ \ n”)。
我的文件如下:
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
通过以下代码,我得到了一条混乱的第一条记录,因为该字符串也位于文件的开头。
awk '{ RS="Sequence: "; FS="\n" }
{
if ($4 != "" )
print $1,"\n",$2,"\n",$3,"\n",$1,"\n",$2,"\n",$4
else
print $1,"\n",$2,"\n",$3 ;
}' short.txt > test
输出:
Sequence:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
Sequence:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
1
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
所以我想我应该从输入文件中删除第一个“ Sequence:”字符串,但这给出了:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
1
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
to:
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
因此,第一条记录又被弄乱了。有解决这个问题的方法吗?我的预期输出是最后一个输出(带或不带字符串“ Sequence:”),但第一个记录正确。
答案 0 :(得分:2)
听起来这是您要执行的操作:
$ cat tst.awk
/^Sequence/ { if (NR>1) prt() }
{ rec[++cnt] = $0 }
END { prt() }
function prt() {
print rec[1] ORS rec[2] ORS rec[3]
if (cnt == 4) {
print rec[1] ORS rec[2] ORS rec[4]
}
cnt=0
}
$ awk -f tst.awk file
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
尝试为此使用RS只会使您的生活更艰难,并且所生成的代码不可移植(仅适用于gawk)
答案 1 :(得分:1)
您的代码可以轻松地固定为:
BEGIN{ RS="Sequence: "; FS="\n" }
(NR==1){next}
{
if ($4 != "" )
print $1,"\n",$2,"\n",$3,"\n",$1,"\n",$2,"\n",$4
else
print $1,"\n",$2,"\n",$3 ;
}
第一个记录为空,这就是为什么用next
跳过它的原因。
您的第一条记录有问题的原因是您在读取第一条记录后定义了RS
和FS
(即不在执行任何操作之前的BEGIN
块中)完全)
但是,请确保您真正想要的是RS="(^|\n)Sequence: "
,这只是为了确保它始于行或文件的开头。