grep排除模式并排除前2行

时间:2018-08-08 10:07:15

标签: bash grep

我有一个文件,想使用grep排除模式。但我也想删除每个匹配的前2行(不包括在内)。我该怎么做?

我尝试过的事情:

cat file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
  Start     End  Strand Pattern                 Mismatch Sequence
    217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___     from: 1   to: 301
  Start     End  Strand Pattern                 Mismatch Sequence
    176     184       + pattern:AA[CT]NNN[AT]CN        . aatcctaca

# With grep -v I can remove the line with pattern

grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___ from: 1 to: 296
Start End Strand Pattern Mismatch Sequence
217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___ from: 1 to: 301
Start End Strand Pattern Mismatch Sequence

# But using -B 2 does not work here

grep -B 2 -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___ from: 1 to: 296
Start End Strand Pattern Mismatch Sequence
217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___ from: 1 to: 301
Start End Strand Pattern Mismatch Sequence

有什么想法也要为每场比赛删除前面的2行吗?

3 个答案:

答案 0 :(得分:2)

GNU sed上进行了测试,其语法/功能可能会因其他实现方式而异

sed -E 'N;N; /[acgt]{3}cc[acgt][acgt]{3}/d' ip.txt
  • -E使用ERE,某些sed版本需要-r而不是-E
  • N;N将另外两行附加到模式空间
  • /[acgt]{3}cc[acgt][acgt]{3}/d如果符合条件则删除
    • 请注意,这将尝试在三行中的任意位置匹配正则表达式...而且,[acgt][acgt]{3}可以简化为[acgt]{4}
    • /\n.*\n.*[acgt]{3}cc[acgt][acgt]{3}/d将限制为仅匹配第三行

答案 1 :(得分:2)

您需要的是:

tac file | awk '/regexp/{c=3} !(c&&c--)' | tac

显然,将regexp设置为要匹配的任何正则表达式,然后将3更改为要跳过的多行,包括匹配行。例如跳过包含7的每一行及其前四行:

$ seq 20 | tac | awk '/7/{c=5} !(c&&c--)' | tac
1
2
8
9
10
11
12
18
19
20

有关如何在匹配行周围打印任意行的信息,请参见https://stackoverflow.com/a/17914105/1745001

以您的示例为例:

$ tac file | awk '/[acgt]{3}cc[acgt][acgt]{3}/{c=3} !(c&&c--)' | tac
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
  Start     End  Strand Pattern                 Mismatch Sequence
    217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc

现在,您可能需要考虑一些有关数据的信息:

$ cat tst.awk
++lineNr == 1 {
    delete fldNr2tag
    delete tagNr2tag
    delete tag2val
    numTags = 0

    for (i=1; i<=NF; i+=2) {
        sub(/:.*/,"",$i)
        tag = $i (i>1 ? "" : 1) # to distinguish the 2 "Sequence" tags
        val = $(i+1)
        tagNr2tag[++numTags] = tag
        tag2val[tag] = val
    }
}
lineNr == 2 {
    for (i=1; i<=NF; i++) {
        tag = $i
        fldNr2tag[i] = tag
    }
}
lineNr == 3 {
    for (i=1; i<=NF; i++) {
        tag = fldNr2tag[i]
        val = $i
        tagNr2tag[++numTags] = tag
        tag2val[tag] = val
    }

    prt()

    lineNr = 0
}

function prt(   tagNr, tag, val) {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tagNr2tag[tagNr]
        val = tag2val[tag]
        printf "tag2val[%s] = <%s>\n", tag, val
    }
    print "----"
}

$ awk -f tst.awk file
tag2val[Sequence1] = <MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___>
tag2val[from] = <1>
tag2val[to] = <296>
tag2val[Start] = <217>
tag2val[End] = <225>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aacacctcc>
----
tag2val[Sequence1] = <M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___>
tag2val[from] = <1>
tag2val[to] = <301>
tag2val[Start] = <176>
tag2val[End] = <184>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aatcctaca>
----

请注意,通过上述操作,您可以按名称访问每个值,因此可以从比较或其他计算中删除不精确和/或错误的匹配项,并且只需使用字段名称,即可选择任意字段以任意顺序打印,例如print tag2val["Sequence"], tag2val["Pattern"]。因此,您可以轻松地将数据转换为CSV以便导入Excel或转换为HTML或JSON或使用它进行几乎所有其他操作。

答案 2 :(得分:1)

查看示例文件,它似乎具有面向记录的结构,因此,我非常谨慎尝试使用诸如grepsed之类的面向行的工具来操纵它。正如评论there is already a similar problem in with a solution in sed中所指出的那样,但脚本并不美观,并且是维护或扩展的噩梦。

我很想写一个简短的Perl或Python脚本来将文件解析为记录,然后使用记录。我不知道文件格式的详细信息,但是如下所示的内容可能是一个不错的开始,并会生成所需的输出。

#!/usr/bin/perl -w

use strict;

my $line = <>;
unless (defined($line) && $line =~ /^Sequence/) {
    die "expected line to start with Sequence";
}
while (defined($line)) {
    my $record = $line;
    $line = <>;
    while (defined($line) && $line !~ /^Sequence/) {
        $record .= $line;
        $line = <>;
    }
    print $record unless $record =~ /[acgt]{3}cc[acgt][acgt]{3}/;
}