如何使用结构准则从文件中删除行

时间:2018-08-08 13:38:58

标签: bash awk sed

我的文件结构突然,当不符合该结构时,我想删除这些行。因此结构应为:1)以单词“ Sequence”开头的行,2)以单词“ Start”开头的行,3)以数字开头的行。

在我的文件中,某些行没有数字,但是有前两行(数字行已使用grep删除)。我希望找到一种用awk或sed的方法,在没有数字行时删除前面的两行。希望有可能吗?

cat file.txt
Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

预期输出:

cat file.txt
Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

4 个答案:

答案 0 :(得分:2)

您可以使用以下awk命令:

awk '/^[0-9]+/ && NR==a["Sequence:"]+2 && NR==a["Start"]+1 {
   print r["Sequence:"] ORS r["Start"] ORS $0
}
/^(Sequence:|Start)/ {
   a[$1]=NR
   r[$1]=$0
}' file

Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

答案 1 :(得分:1)

% awk '
  $1 == "Sequence:" {seq   = $0}
  $1 == "Start"     {start = $0}
  $1 ~ /^[0-9]*$/ && l "Start" && L == "Sequence:" {print seq;print start;print}
  {L = l;}
  {l = $1}' file.txt

答案 2 :(得分:1)

对于适合内存的文件,您可以对整个文件进行处理并处理

perl -0777 -pe 's/^Sequence.*\nStart.*\n(?!\d)//m' ip.txt
  • -0777抓取整个文件
  • m标志,这样^$锚也将在多行字符串中起作用
  • ^Sequence.*\nStart.*\n(?!\d)仅与^Sequence.*\nStart.*\n匹配,除非后面没有数字。请注意,除非使用.标志,否则s将不匹配换行符

或者,您可以匹配并仅打印正确的分组

perl -0777 -ne 'print /^Sequence.*\nStart.*\n\d.*\n/mg' ip.txt

答案 3 :(得分:1)

仅打印三行记录,您需要做的是:

$ cat tst.awk
/^Sequence:/ { lineNr=0; rec="" }
{ rec = (++lineNr > 1 ? rec ORS : "") $0 }
lineNr == 3 { print rec }

例如:

$ awk -f tst.awk file
Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

,但是要使用一种更有用的方法来分析数据,请再次查看my answer to your previous question底部的脚本。要调整以丢弃少于3行的记录,您需要做的就是将lineNr=0块内的lineNr==3设置移动到新的/Sequence:/块,脚本将继续执行以下操作:给出一个数组,您可以按其名称访问字段:

$ cat tst.awk
/^Sequence:/ { lineNr = 0 }

++lineNr == 1 {
    delete fldNr2tag
    delete tagNr2tag
    delete tag2val
    numTags = 0

    for (i=1; i<=NF; i+=2) {
        sub(/:.*/,"",$i)
        tag = $i (i>1 ? "" : 1) # to distinguish the 2 "Sequence" tags
        val = $(i+1)
        tagNr2tag[++numTags] = tag
        tag2val[tag] = val
    }
}
lineNr == 2 {
    for (i=1; i<=NF; i++) {
        tag = $i
        fldNr2tag[i] = tag
    }
}
lineNr == 3 {
    for (i=1; i<=NF; i++) {
        tag = fldNr2tag[i]
        val = $i
        tagNr2tag[++numTags] = tag
        tag2val[tag] = val
    }

    prt()
}

function prt(   tagNr, tag, val) {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tagNr2tag[tagNr]
        val = tag2val[tag]
        printf "tag2val[%s] = <%s>\n", tag, val
    }
    print "----"
}

$ awk -f tst.awk file
tag2val[Sequence1] = <HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_>
tag2val[from] = <1>
tag2val[to] = <296>
tag2val[Start] = <217>
tag2val[End] = <225>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aacacctcc>
----
tag2val[Sequence1] = <MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___>
tag2val[from] = <1>
tag2val[to] = <296>
tag2val[Start] = <217>
tag2val[End] = <225>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aacacctcc>
----
tag2val[Sequence1] = <L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___>
tag2val[from] = <1>
tag2val[to] = <301>
tag2val[Start] = <176>
tag2val[End] = <184>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aatactaca>
----
tag2val[Sequence1] = <X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__>
tag2val[from] = <1>
tag2val[to] = <290>
tag2val[Start] = <176>
tag2val[End] = <184>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aatactaca>
----

如果您只想按原样打印输入行,那将是微不足道的,但是我真的认为以上就是您想要添加各种比较和输出组合的原因。