我有一个包含数万行的blastn输出文件。我只对部分查询序列ID与主题序列ID的一部分不匹配的行感兴趣,我希望将其放入新的文本文件中。以下是我想要从中提取信息的大量输出文件的摘录,例如:
qseqid qlen qstart qend sseqid slen sstart send evalue bitscore length pident nident mismatch gaps
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 121 679 OFAS003927-RA-EXON03_Anisoscelini_Anisoscelis_flavolineatus_CMF_0018_S7_L005_UQ_trinity_assembled 557 1 557 0 832 562 93.594 526 28 8
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 155 650 OFAS003927-RA-EXON03_Placoscelini_Plaxiscelis_limbata_CMF_0072_S29_L005_UQ_trinity_assembled 820 327 819 0 808 496 96.169 477 16 3
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 222 686 OFAS003927-RA-EXON03_Anisoscelini_Leptoscelis_tricolor_CMF_0079_S32_L005_UQ_trinity_assembled 465 1 465 0 793 465 97.419 453 12 0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 429 635 OFAS003927-RA-EXON03B_Clavigrallini_Clavigralla_sp_CMF_0335_S81_L005_UQ_trinity_assembled 655 1 207 4.30E-87 316 207 94.203 195 12 0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 531 629 OFAS003927-RA-EXON07_Mictini_Anoplocnemis_sp_CMF_0052_S20_L005_UQ_trinity_assembled 668 1 99 9.92E-39 156 99 94.949 94 5 0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 0 1286 696 100 696 0 0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_declivis_CMF_0069_S26_L005_UQ_trinity_assembled 1060 332 1025 0 1212 696 98.132 683 11 2
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_thomasi_CMF_0028_S13_L005_UQ_trinity_assembled 814 50 745 0 1147 698 96.418 673 21 4
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 695 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_confraterna_CMF_0123_S44_L005_UQ_trinity_assembled 1313 578 1274 0 1131 699 95.994 671 22 6
qseqid =查询序列ID
sseqid =主题序列ID
每个行的两个ID之间的OFAS#-RA-EXON#应该匹配什么。如果不是这种情况,例如第4行和第5行,我想提取整行并放入新的文本文件中。我知道需要使用一些正则表达式模式,但是对于我来说,如何指示列和按行搜索并不清楚。
答案 0 :(得分:0)
这适用于GNU Awk:
tail -n+2 input.txt | awk '{ if( substr($1,0,21) != substr($5,0,21)) { print $0 } }'
问候!