Question

我有一个带有这样标题的fasta文件：

612407518| Streptomyces sp. MJ635-86F5 DNA, cremimycin biosynthetic gene cluster, complete sequence
84617315| Streptomyces achromogenes subsp. rubradiris complete rubradirin biosynthetic gene cluster, strain NRRL 3061
345134845| Streptomyces sp. SN-593 DNA, reveromycin biosynthetic gene cluster, complete sequence
323700993| Streptomyces autulyticus strain CGMCC 0516 geldanamycin polyketide biosynthetic gene cluster, complete sequence
15823967| Streptomyces avermitilis oligomycin biosynthetic gene cluster
1408941746| Streptomyces sp. strain OUC6819 rdm biosynthetic gene cluster, complete sequence
315937014| Uncultured organism CA37 glycopeptide biosynthetic gene cluster, complete sequence
29122977| Streptomyces cinnamonensis polyether antibiotic monensin biosynthetic gene cluster, partial sequence
257129259| Moorea producens 19L curacin A biosynthetic gene cluster, partial sequence
166159347| Streptomyces sahachiroi azinomycin B biosynthetic gene cluster, partial sequence

我只想在标题说明中的“生物合成基因簇”之前保留一个单词，结果是这样的：

 612407518|cremimycin
 84617315|rubradirin
 345134845|reveromycin
 323700993|polyketide
 15823967|oligomycin
 1408941746|rdm
 315937014|glycopeptide
 29122977|monensin
 257129259|curacin A
 166159347|azinomycin B

这是我尝试对具有200个以上标题的原始文件进行的尝试：

with open("test.txt") as f:
    for line in f:
        (id, name) = line.strip().split('|')
        term_list = name.split()
        term_index = term_list.index('biosynthetic') 

        term = term_list[int(term_index)-1]

        header = id + '|' + term
        print(header)

结果很好，尽管他在上面的示例中最后两个标头产生了以下结果：

257129259|A
166159347|B

我将处理第二个问题，因为我的原始数据包含很多此类问题。

谢谢大家的评论。

Answer 1

比正则表达式更简单的解决方案是：

将字符串分割为“ |”，将两个分量分配给变量id和s。
将s分解为单词。
在结果列表中找到“生物合成”的位置。
验证其后面紧跟着“ gene”和“ clusters”。
打印id，后跟“生物合成”之前的单词。

我故意不编写代码。如果您尝试将其尝试放入问题中，则其他人可能会回答并告诉您如何进行操作（假设您不能自己这样做）。

祝你好运！

Answer 2

不使用正则表达式。如果标头不是指定格式（例如，始终具有“生物合成基因簇”，始终具有|定义ID，并始终在所需单词之前空格），则将抛出ValueError。

id = header[:header.index("|")+1] 
end = header.index(" biosynthetic gene cluster")
word = header[header[:end].rindex(" ")+1:end]
new_title = id + word

Answer 3

您可以使用Python的str.split()方法获取数字，直到管道定界符为止。

为了抓住字符串后面的单词，您可能需要使用negative lookahead。

Answer 4

尝试正则表达式：reg = re.match(r'(\d+)\|.* (\w+) biosynthetic gene cluster', txt)，然后可以使用reg.group(1)和reg.group(2)

保留子字符串在python中的长字符串？

4 个答案: