如果两个列的部分在文件中匹配,则复制行并移动到新文件

时间:2018-04-11 15:39:29

标签: bash

我有一个包含数万行的blastn输出文件。我只对部分查询序列ID与主题序列ID的一部分不匹配的行感兴趣,我希望将其放入新的文本文件中。以下是我想要从中提取信息的大量输出文件的摘录,例如:

qseqid qlen qstart qend sseqid slen sstart send evalue bitscore length pident nident mismatch gaps
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   744 121 679 OFAS003927-RA-EXON03_Anisoscelini_Anisoscelis_flavolineatus_CMF_0018_S7_L005_UQ_trinity_assembled   557 1   557 0   832 562 93.594  526 28  8
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   744 155 650 OFAS003927-RA-EXON03_Placoscelini_Plaxiscelis_limbata_CMF_0072_S29_L005_UQ_trinity_assembled    820 327 819 0   808 496 96.169  477 16  3
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   744 222 686 OFAS003927-RA-EXON03_Anisoscelini_Leptoscelis_tricolor_CMF_0079_S32_L005_UQ_trinity_assembled   465 1   465 0   793 465 97.419  453 12  0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   744 429 635 OFAS003927-RA-EXON03B_Clavigrallini_Clavigralla_sp_CMF_0335_S81_L005_UQ_trinity_assembled   655 1   207 4.30E-87    316 207 94.203  195 12  0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   744 531 629 OFAS003927-RA-EXON07_Mictini_Anoplocnemis_sp_CMF_0052_S20_L005_UQ_trinity_assembled 668 1   99  9.92E-39    156 99  94.949  94  5   0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   696 1   696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   696 1   696 0   1286    696 100 696 0   0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   696 1   696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_declivis_CMF_0069_S26_L005_UQ_trinity_assembled    1060    332 1025    0   1212    696 98.132  683 11  2
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   696 1   696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_thomasi_CMF_0028_S13_L005_UQ_trinity_assembled 814 50  745 0   1147    698 96.418  673 21  4
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled   696 1   695 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_confraterna_CMF_0123_S44_L005_UQ_trinity_assembled 1313    578 1274    0   1131    699 95.994  671 22  6

qseqid =查询序列ID

sseqid =主题序列ID

每个行的两个ID之间的OFAS#-RA-EXON#应该匹配什么。如果不是这种情况,例如第4行和第5行,我想提取整行并放入新的文本文件中。我知道需要使用一些正则表达式模式,但是对于我来说,如何指示列和按行搜索并不清楚。

1 个答案:

答案 0 :(得分:0)

这适用于GNU Awk:

tail -n+2 input.txt | awk  '{ if( substr($1,0,21) !=  substr($5,0,21)) { print $0 } }'

问候!