删除文件中可变字符串之间的空格

时间:2018-07-27 15:49:47

标签: removing-whitespace

我有718个格式类似的文件,需要稍微清理一下才能使程序使用每个文件。原始文件从第一行开始有一个空格,需要删除该空格。 DNA序列每10个碱基之间不应有间隔(可以在多行中将其分解)。在下面,我首先显示原始文件的外观,然后是 应该 的外观。

原文:

 14 128
Alydidae_Micrelytrinae_Leptocorisini_Stenocoris_tipuloides_CMF_0174_S59_L005     caggacccga ggttcaacag cgagattgac atgaggacag gttacaagac
Coreidae_Coreinae_Acanthocephalini_Acanthocephala_thomasi_CMF0028_UQ             caggacccgc gatttaacag tgagatagac atgcgaacag gctacaagac
Coreidae_Coreinae_Anisoscelini_Anisoscelis_alipes_CMF0018_UQ                     caggacccgc ggtttaacag tgagatagac atgcgaactg gctacaagac
Coreidae_Coreinae_Mictini_Anoplocnemis_sp_CMF0020_UQ                             caggacccgc gcttcaacag tgagatagac atgcgaacag gctataagac
Coreidae_Coreinae_Mictini_Mygdonia_tuberculosa_CMF0053_UQ                        caggacccgc gcttcaacag tgagatagac atgcgaacag gctataagac
Coreidae_Coreinae_Nematopodini_Mozena_nr_lineolata_CMF0026_UQ                    caggacccgc ggttcaacag cgatatagac atgcgaacag gctacaggac
Coreidae_Coreinae_Nematopodini_Thasus_neocalifornicus_CMF0190_UQ                 caggacccgc gtttcaacag cgagatcgat atgcggacag ggtacaagac
Coreidae_Pseudophloeinae_Clavigrallini_Clavigralla_sp_CMF_0335_S81_L005_UQ       caggatccga ggttcaacag cgagatagac atgaggacag gttacaaaac
Coreidae_Pseudophloeinae_Pseudophloeini_Myla_sp_CMF_0091_S35_L005_UQ             caggacccga ggttcaacag cgagatagac atgcggacag gctataaaac
Largidae_Largus_sp_CMF_0230_S65_L005_UQ                                          caggacccga ggttcaacag cgaaatagac atgaggactg gctataagac
Pentatomidae_Halyomorpha_halys_halhal1                                           caggatccga ggttcaacag cgaaatcgac atgaggactg gctacaagac
Pyrrhocoridae_Dysdercus_mimus_CMF_0110_S42_L005_UQ                               caggatcctc gtttcaacag cgaaatcgac atgagaacag gttacaagac
Pyrrhocoridae_Dysdercus_suturellus_CMF_0305_S71_L005_UQ                          caggatcctc gtttcaacag cgaaatcgac atgagaacag gttacaagac
Rhopalidae_Serinethinae_Serinethini_Jadera_haematoloma_CMF_0281_S69_L005_UQ      caggaccccc gttttaacag tgaaatagac atgcgaaccg gttacaagac

                                                                             taacactatc ctctgcggcc ccatctctaa ctacgaaggt gatgtgattg
                                                                             caacactatc ctctgtgggc ccatctctaa ctacgaagga gaggtgatag
                                                                             caacaccatc ctctgtgggc ctatttctaa ctacgaaggg gaggtgatag
                                                                             caacactata ctctgcgggc ctatatccaa ctacgaagga gaggtgattg
                                                                             caacacgata ctctgtgggc ctatatctaa ctacgaagga gaggtgatag
                                                                             gaacaccatc ctttgcgggc cgatctccaa ctacgagggg gaggtgatcg
                                                                             caacaccatc ctctgcgggc ctatctccaa ctacgaaggg gaggtgatcg
                                                                             caacaccatc ctctgtggac ccatctctaa ctacgaagga gaagtgatag
                                                                             caacaccatc ctctgcgggc ccatctccaa ctacgaaggg gaggtgatcg
                                                                             tcataccatt ctatgtgggc ctatttcaaa ttacgaaggg gaagtgatcg
                                                                             taacaccatc ctctgcggcc ccatttccaa ctacgaaggc gaagtgattg
                                                                             caacacaata ctctgcggac ccatatcgaa ctacgaaggt gaagtcatag
                                                                             caacacaata ctctgcggac ccatatcgaa ctacgaaggt gaagtcatag
                                                                             ccacaccatc ctctgcggac ccatctccaa ctacgaaggt gaggtgatag

                                                                             gagttgccca gatcatcaac aagactga
                                                                             gagtagctca gatcatcaac aagaccga
                                                                             gggtagctca gatcatcaac aagacgga
                                                                             gagtagctca gatcatcaat aagactga
                                                                             gagtagctca gatcatcaat aagaccga
                                                                             gggtggcaca gatcatcaac aagacgga
                                                                             gagtggctca gatcatcaac aagacgga
                                                                             gcgtcgcaca gatcatc--- --------
                                                                             gcgtcgcaca gatcataaac aagaccga
                                                                             gggtagccca gatcataaac aaaacaga
                                                                             gagtcgccca gatcatcaac aaaactga
                                                                             gagtggcgca gatcatcatt aaaaccga
                                                                             gagtggcgca gatcatcaat aaaacgga
                                                                             gagtagccca gatcatcaac aagacgga

处理后如何

14 128
Alydidae_Micrelytrinae_Leptocorisini_Stenocoris_tipuloides_CMF_0174_S59_L005       caggacccgaggttcaacagcgagattgacatgaggacaggttacaagac
Coreidae_Coreinae_Acanthocephalini_Acanthocephala_thomasi_CMF0028_UQ               caggacccgcgatttaacagtgagatagacatgcgaacaggctacaagac
Coreidae_Coreinae_Anisoscelini_Anisoscelis_alipes_CMF0018_UQ                       caggacccgcggtttaacagtgagatagacatgcgaactggctacaagac
Coreidae_Coreinae_Mictini_Anoplocnemis_sp_CMF0020_UQ                               caggacccgcgcttcaacagtgagatagacatgcgaacaggctataagac
Coreidae_Coreinae_Mictini_Mygdonia_tuberculosa_CMF0053_UQ                          caggacccgcgcttcaacagtgagatagacatgcgaacaggctataagac
Coreidae_Coreinae_Nematopodini_Mozena_nr_lineolata_CMF0026_UQ                      caggacccgcggttcaacagcgatatagacatgcgaacaggctacaggac
Coreidae_Coreinae_Nematopodini_Thasus_neocalifornicus_CMF0190_UQ                   caggacccgcgtttcaacagcgagatcgatatgcggacagggtacaagac
Coreidae_Pseudophloeinae_Clavigrallini_Clavigralla_sp_CMF_0335_S81_L005_UQ         caggatccgaggttcaacagcgagatagacatgaggacaggttacaaaac
Coreidae_Pseudophloeinae_Pseudophloeini_Myla_sp_CMF_0091_S35_L005_UQ               caggacccgaggttcaacagcgagatagacatgcggacaggctataaaac
Largidae_Largus_sp_CMF_0230_S65_L005_UQ                                            caggacccgaggttcaacagcgaaatagacatgaggactggctataagac
Pentatomidae_Halyomorpha_halys_halhal1                                             caggatccgaggttcaacagcgaaatcgacatgaggactggctacaagac
Pyrrhocoridae_Dysdercus_mimus_CMF_0110_S42_L005_UQ                                 caggatcctcgtttcaacagcgaaatcgacatgagaacaggttacaagac
Pyrrhocoridae_Dysdercus_suturellus_CMF_0305_S71_L005_UQ                            caggatcctcgtttcaacagcgaaatcgacatgagaacaggttacaagac
Rhopalidae_Serinethinae_Serinethini_Jadera_haematoloma_CMF_0281_S69_L005_UQ        caggacccccgttttaacagtgaaatagacatgcgaaccggttacaagac

                                                                               taacactatcctctgcggccccatctctaactacgaaggtgatgtgattg
                                                                               caacactatcctctgtgggcccatctctaactacgaaggagaggtgatag
                                                                               caacaccatcctctgtgggcctatttctaactacgaaggggaggtgatag
                                                                               caacactatactctgcgggcctatatccaactacgaaggagaggtgattg
                                                                               caacacgatactctgtgggcctatatctaactacgaaggagaggtgatag
                                                                               gaacaccatcctttgcgggccgatctccaactacgagggggaggtgatcg
                                                                               caacaccatcctctgcgggcctatctccaactacgaaggggaggtgatcg
                                                                               caacaccatcctctgtggacccatctctaactacgaaggagaagtgatag
                                                                               caacaccatcctctgcgggcccatctccaactacgaaggggaggtgatcg
                                                                               tcataccattctatgtgggcctatttcaaattacgaaggggaagtgatcg
                                                                               taacaccatcctctgcggccccatttccaactacgaaggcgaagtgattg
                                                                               caacacaatactctgcggacccatatcgaactacgaaggtgaagtcatag
                                                                               caacacaatactctgcggacccatatcgaactacgaaggtgaagtcatag
                                                                               ccacaccatcctctgcggacccatctccaactacgaaggtgaggtgatag

                                                                               gagttgcccagatcatcaacaagactga
                                                                               gagtagctcagatcatcaacaagaccga
                                                                               gggtagctcagatcatcaacaagacgga
                                                                               gagtagctcagatcatcaataagactga
                                                                               gagtagctcagatcatcaataagaccga
                                                                               gggtggcacagatcatcaacaagacgga
                                                                               gagtggctcagatcatcaacaagacgga
                                                                               gcgtcgcacagatcatc-----------
                                                                               gcgtcgcacagatcataaacaagaccga
                                                                               gggtagcccagatcataaacaaaacaga
                                                                               gagtcgcccagatcatcaacaaaactga
                                                                               gagtggcgcagatcatcattaaaaccga
                                                                               gagtggcgcagatcatcaataaaacgga
                                                                               gagtagcccagatcatcaacaagacgga

0 个答案:

没有答案