我有一个文件,想使用grep排除模式。但我也想删除每个匹配的前2行(不包括在内)。我该怎么做?
我尝试过的事情:
cat file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___ from: 1 to: 296
Start End Strand Pattern Mismatch Sequence
217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___ from: 1 to: 301
Start End Strand Pattern Mismatch Sequence
176 184 + pattern:AA[CT]NNN[AT]CN . aatcctaca
# With grep -v I can remove the line with pattern
grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___ from: 1 to: 296
Start End Strand Pattern Mismatch Sequence
217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___ from: 1 to: 301
Start End Strand Pattern Mismatch Sequence
# But using -B 2 does not work here
grep -B 2 -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___ from: 1 to: 296
Start End Strand Pattern Mismatch Sequence
217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___ from: 1 to: 301
Start End Strand Pattern Mismatch Sequence
有什么想法也要为每场比赛删除前面的2行吗?
答案 0 :(得分:2)
在GNU sed
上进行了测试,其语法/功能可能会因其他实现方式而异
sed -E 'N;N; /[acgt]{3}cc[acgt][acgt]{3}/d' ip.txt
-E
使用ERE,某些sed版本需要-r
而不是-E
N;N
将另外两行附加到模式空间/[acgt]{3}cc[acgt][acgt]{3}/d
如果符合条件则删除
[acgt][acgt]{3}
可以简化为[acgt]{4}
/\n.*\n.*[acgt]{3}cc[acgt][acgt]{3}/d
将限制为仅匹配第三行答案 1 :(得分:2)
您需要的是:
tac file | awk '/regexp/{c=3} !(c&&c--)' | tac
显然,将regexp
设置为要匹配的任何正则表达式,然后将3
更改为要跳过的多行,包括匹配行。例如跳过包含7
的每一行及其前四行:
$ seq 20 | tac | awk '/7/{c=5} !(c&&c--)' | tac
1
2
8
9
10
11
12
18
19
20
有关如何在匹配行周围打印任意行的信息,请参见https://stackoverflow.com/a/17914105/1745001。
以您的示例为例:
$ tac file | awk '/[acgt]{3}cc[acgt][acgt]{3}/{c=3} !(c&&c--)' | tac
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___ from: 1 to: 296
Start End Strand Pattern Mismatch Sequence
217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc
现在,您可能需要考虑一些有关数据的信息:
$ cat tst.awk
++lineNr == 1 {
delete fldNr2tag
delete tagNr2tag
delete tag2val
numTags = 0
for (i=1; i<=NF; i+=2) {
sub(/:.*/,"",$i)
tag = $i (i>1 ? "" : 1) # to distinguish the 2 "Sequence" tags
val = $(i+1)
tagNr2tag[++numTags] = tag
tag2val[tag] = val
}
}
lineNr == 2 {
for (i=1; i<=NF; i++) {
tag = $i
fldNr2tag[i] = tag
}
}
lineNr == 3 {
for (i=1; i<=NF; i++) {
tag = fldNr2tag[i]
val = $i
tagNr2tag[++numTags] = tag
tag2val[tag] = val
}
prt()
lineNr = 0
}
function prt( tagNr, tag, val) {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tagNr2tag[tagNr]
val = tag2val[tag]
printf "tag2val[%s] = <%s>\n", tag, val
}
print "----"
}
。
$ awk -f tst.awk file
tag2val[Sequence1] = <MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___>
tag2val[from] = <1>
tag2val[to] = <296>
tag2val[Start] = <217>
tag2val[End] = <225>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aacacctcc>
----
tag2val[Sequence1] = <M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___>
tag2val[from] = <1>
tag2val[to] = <301>
tag2val[Start] = <176>
tag2val[End] = <184>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aatcctaca>
----
请注意,通过上述操作,您可以按名称访问每个值,因此可以从比较或其他计算中删除不精确和/或错误的匹配项,并且只需使用字段名称,即可选择任意字段以任意顺序打印,例如print tag2val["Sequence"], tag2val["Pattern"]
。因此,您可以轻松地将数据转换为CSV以便导入Excel或转换为HTML或JSON或使用它进行几乎所有其他操作。
答案 2 :(得分:1)
查看示例文件,它似乎具有面向记录的结构,因此,我非常谨慎尝试使用诸如grep
和sed
之类的面向行的工具来操纵它。正如评论there is already a similar problem in with a solution in sed
中所指出的那样,但脚本并不美观,并且是维护或扩展的噩梦。
我很想写一个简短的Perl或Python脚本来将文件解析为记录,然后使用记录。我不知道文件格式的详细信息,但是如下所示的内容可能是一个不错的开始,并会生成所需的输出。
#!/usr/bin/perl -w
use strict;
my $line = <>;
unless (defined($line) && $line =~ /^Sequence/) {
die "expected line to start with Sequence";
}
while (defined($line)) {
my $record = $line;
$line = <>;
while (defined($line) && $line !~ /^Sequence/) {
$record .= $line;
$line = <>;
}
print $record unless $record =~ /[acgt]{3}cc[acgt][acgt]{3}/;
}