awk - 在线找到模式并将其与上游部分一起移除

时间:2014-11-05 10:15:09

标签: bash search awk

我最初过滤了我的文本文件,只包含那些已识别出模式的行(在本例中为“TCTGTACTATATTG”)。现在,从生成的文件中,我想从包含它的每一行中删除此模式以及上游字符。 使用AWK的最佳方法是什么?

以下是我的意见:

@DGTKZQN1:384:C364AACXX:1:1109:19757:66886 2:N:0:GTGAAA
AACAGTTTCTGTACTATATTGACTCATAAGAGTGGTTTAATACGAAGGGAGGAGAAGTTTCCTGGAAATAATCGATTTCCTAGCTTTTAGTTGCAATAAT
+
CCCFFFFFHHHHDIIJJJJJJJJJIIJEIJHHCFGFFGHIIIIJGGIJGG@GHIGEEFDGGIGIJJIEHGIEHHHEDFFFDEEEDDEDDCCDBDDDCDDD
@DGTKZQN1:384:C364AACXX:1:1109:20360:66756 2:N:0:GTGAAA
TTTCTGTACTATATTGGGTGTGAGAAGTAATGGTGCACTCCACAGACCTCCAGTGGCTGCTTGTTCGCCAGAACAGCAAATTTCTGCAGAAGCGCAAAAG
+
@@CFFFFFHHHGHIIIJI;GCGGIIIJFHIIJGEDGGIJIICBDFIIIIJHIIGHIDHGEEHGHHIIJHGD?DDFEECEDDDDCDCCDDDCDDDDDDBC>
@DGTKZQN1:384:C364AACXX:1:1109:21207:66784 2:N:0:GTGAAA
AACAGTTTCTGTACTATATTGTACGTTGTGGATTATTAAAGGGAATAAAAGTGGTAGATTGTGCAGTTGAGGCAGGCTCTCAACTGTGAAACAGCGGTGG
+
@@CFFBDDFHBDCGG<?:CEEAFEEF@A3<?<3C>FEGHGG@DB?8BF@G>?0909??DF>HE@C=)8CEH9DHCB:AED>?C@6>C;6>C3?3=@B8B=
@DGTKZQN1:384:C364AACXX:1:1109:21026:66836 2:N:0:GTGAAA
AGAACAGTTTCTGTACTATATTGTTATACTTCTGTTGTGGGTGTAGAGTTTTCTCCGGCGTTGGCTTCAATGGAATAAGGCACGAGATGAATCCGTGGAG
+
@@@FFFFDHHHDHHIIJJEHHJGJJIGIIEIIIIEHEGHIJDF?DGEE4??DG@FGEG:FHHHHF@D@CEACEEEDDDCCCDDBDDDDDDDACDB??>BD

输出应该是这样的:

@DGTKZQN1:384:C364AACXX:1:1109:19757:66886 2:N:0:GTGAAA
ACTCATAAGAGTGGTTTAATACGAAGGGAGGAGAAGTTTCCTGGAAATAATCGATTTCCTAGCTTTTAGTTGCAATAAT
+
CCCFFFFFHHHHDIIJJJJJJJJJIIJEIJHHCFGFFGHIIIIJGGIJGG@GHIGEEFDGGIGIJJIEHGIEHHHEDFFFDEEEDDEDDCCDBDDDCDDD
@DGTKZQN1:384:C364AACXX:1:1109:20360:66756 2:N:0:GTGAAA
GGTGTGAGAAGTAATGGTGCACTCCACAGACCTCCAGTGGCTGCTTGTTCGCCAGAACAGCAAATTTCTGCAGAAGCGCAAAAG
+
@@CFFFFFHHHGHIIIJI;GCGGIIIJFHIIJGEDGGIJIICBDFIIIIJHIIGHIDHGEEHGHHIIJHGD?DDFEECEDDDDCDCCDDDCDDDDDDBC>
@DGTKZQN1:384:C364AACXX:1:1109:21207:66784 2:N:0:GTGAAA
TACGTTGTGGATTATTAAAGGGAATAAAAGTGGTAGATTGTGCAGTTGAGGCAGGCTCTCAACTGTGAAACAGCGGTGG
+
@@CFFBDDFHBDCGG<?:CEEAFEEF@A3<?<3C>FEGHGG@DB?8BF@G>?0909??DF>HE@C=)8CEH9DHCB:AED>?C@6>C;6>C3?3=@B8B=
@DGTKZQN1:384:C364AACXX:1:1109:21026:66836 2:N:0:GTGAAA
TTATACTTCTGTTGTGGGTGTAGAGTTTTCTCCGGCGTTGGCTTCAATGGAATAAGGCACGAGATGAATCCGTGGAG
+
@@@FFFFDHHHDHHIIJJEHHJGJJIGIIEIIIIEHEGHIJDF?DGEE4??DG@FGEG:FHHHHF@D@CEACEEEDDDCCCDDBDDDDDDDACDB??>BD

我已经尝试过使用awk和split函数,但我正在努力将字符串用作字段分隔符。

2 个答案:

答案 0 :(得分:1)

看起来简单的sed应该适合您:

sed -i.bak 's/^.*TCTGTACTATATTG//g' file

使用awk:

awk '{gsub(/^.*TCTGTACTATATTG/, "")} 1' file

但是使用sed也可以为内联编辑带来好处。

答案 1 :(得分:0)

sed -i.bak 's/.*TCTGTACTATATTG//g' file