psbki.fas:
>E_oleracea_Docas_de_Belm
AACCT
ycf1b.fas:
>E_oleracea_Docas_de_B
GGTTC
output:
>E_oleracea_Docas_de_Belm
AACCTGGTTC
如果您查看两个文件中的物种名称,它们是用一些语法问题编写的,因此彼此之间是不同的。另外,我还有另一个问题:两个文件中都没有某些物种。
为解决这些问题,我编写了以下代码:
ids, sequences = parse_fasta(open('psbki.fas', 'r').read().split('\n'))
ids2, sequences2 = parse_fasta(open('ycf1b.fas', 'r').read().split('\n'))
for i, j, z, h in zip(ids, sequences, sequences2, ids2):
if i != h:
print(">"+i + "\n"+j)
else:
print(">"+i + "\n"+j+z)
前两个序列的输出正常。但是对于其他序列,该代码仅打印一个文件中的文件,但它们同时在两个文件中。 我的代码有什么问题? 我是python的新手
输出为:
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAATTGGTATTATTCTCATTATCAGCATAAATTATCACACGTCTGGCTCTTCTTGAACGAATTTCAATATCTTCTATCGGTTTTTCCTCATTTTCTTCCTCCTGTTCTTCCAGAAGATTGGTCAATTTATATGACCATCGAGAAACCTTTTTACTGATTTCTTCTATTCCAATAGATTCATTTCTAGTTGTTTTATCATTTGGATCAATTGTCATTATATCGAATACAAATTTCAAAGATTTTGCTTGACTTTCTGAATCCATTTTTCTTTGTTCTGCCAATAAAGAACAGTTTTTCAAACAAAAATTGGGTGTGAATTCAAAAGAAAATGAAGTTAAGGAATTACCGATATAATTCAAAAATGATTTACCACCACCAAGTGAATTCTTTTGATGTTCAAATTCTCTGAAATTATTAGGAAGTAGCTCATGGATCTTATTTATCCAAAGACTTTTTATGGAATCCTCCATATAAGGGAAAAAATCATTTATGATTGTACGTAAATCAAAATCTTTTATTGCTCCACGGCATGGTCCGCTCAATAAAGGATCATATGTTTTGGTCAAGCATTTTTGTTTATTCTCATGATTGCAAAATCTAGTCTTTTTTTCGAGCATATCTAGAGCAAGAAATCCCTTTTCTTTTTTTTCTTTTTCTAGAGCTTTTATTCGACTTATTAATTCATTGCTCAAGTTGTATTTTTTTTGTTCATTGGTAAAAACCCAAAAATTATACAGGTCTCCATGGGATAATTTTTT-GTCGTGTACAAAAACATTTTTCGTTCTATCATTTCC
>E_edulis_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAATTGGTATTATTCTCATTATCAGCATAAATTATCACACGTCTGGCTCTTCTTGAACGAATTTCAATATCTTCTATCGGTTTTTCCTCAATTTCTTCCTCCTGTTCTTCCAGAAGATTGGTCAATTTATATGACCATCGAGAAACCTTTTTACTGATTTCTTCTATTCCAATAGATTCATTTCTAGTTGTTTTATCATTTGGATCAATTGTCATTATATCGAATACAAATTTCAAAGATTTTGCTTGACTTTCTGAATCCATTTTTCTTTGTTCTGCCAATAAAGAACAGTTTTTCAAACAAAAATTGGGTGTGAATTCAAAAGAAAATGAAGTTAAGGAATTACCGATATAATTCAAAAATGATTTACCACCACCAAGTGAATTCTTTTGATGTTCAAATTCTCTGAAATTATTAGGAAGTAGCTCATGGATCTTATTTATCCAAAGACTTTTTATGGAATCCTCCATATAAGGGAAAAAATCATTTATGATTGTACGTAAATCAAAATCTTTTATTGCTCCACGGCATGGTCCGCTCAATAAAGGATCATATGTTTTGGTCAAGCATTTTTGTTTATTCTCATGATTGCAAAATCTAGTCTTTTTTTCGAGCATATCTAGAGCAAGAAATCCCTTTTCTTTTTTTTCTTTTTCTAGAGCTTTTATTCGACTTATTAATTCATTGCTCAAGTTGTATTTTTTTTGTTCATTGGTAAAAACCCAAAAATTATACAGGTCTCCATGGGATAATTTTTTTGTCGTGTACAAAAACATTTTTCGTTCTATCATTTCC
>E_edulis_F7
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG
>E_edulis_R10
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_R11
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGKGTATGTGGTAAAGTAAAAAATAASTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_R12
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGWGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAA----ATCTTG
>E_edulis_IFES
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGARCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG
>E_oleracea_Ilha_do_combu_1
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_5
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_10
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Mangal_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Mangal_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Docas_de_Belm
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Utinga
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_1
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_IFES
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
换句话说,我想串联两个文件中的基因,而不打印,而不串联仅出现在一个文件中的物种。我不知道该如何解决物种最少写错误的问题。
编辑1: 我更改了代码,使用Levenshtein比率解决了某些物种名称中的书写错误,但输出是相同的。
新代码为:
import Levenshtein as lev
Str1 = str(ids)
Str2 = str(ids2)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
for i, j, z, h in zip(ids, sequences, sequences2, ids2):
if lev.ratio(i,h) > 0.70 and i in h:
print(">"+i + "\n"+j+z)
else:
print(">"+i + "\n"+j)
编辑2
Input File1: gene 1
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_F7
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG
Input File 2: gene 2
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_ed_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
My desired output:
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
P.S。在第二个文件中,我具有相同的种类E_edulis_I2,但名称不完整-> E_ed_I2。想要脚本识别出该序列并将其与第一个序列连接(文件1 = E_edulis_I2)。另一个问题是E_edulis_F7硬币仅出现在文件1中,所以我不希望在输出中添加thar硬币。