如何在python中删除以相同字符(但随机)开头的行?

时间:2018-11-15 20:09:41

标签: python bioinformatics matching dna-sequence

我正在尝试删除文件中以相同的5个字符开头的行,但是前5个字符是随机的(我不知道它们将是什么)吗?

我有一个代码读取文件第一行的最后5个字符,并将它们与文件中具有相同5个字符的随机行上的FIRST 5个字符进行匹配。问题是,当有两个或多个匹配项的前5个字符相同时,代码就会混乱。我需要读取文件中所有行并删除具有相同的5个首字符的两行之一的内容。

示例(问题):

CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT

从文件中取出一个文件后,我需要什么:

CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

(无第三行)

如果您能用言语解释我该如何做,我将不胜感激。

1 个答案:

答案 0 :(得分:0)

例如,您可以这样做:

FILE_NAME = "data.txt"                       # the name of the file to read in
NR_MATCHING_CHARS = 5                        # the number of characters that need to match

lines = set()                                # a set of lines that contain the beginning of the lines that have already been outputted
with open(FILE_NAME, "r") as inF:            # open the file
    for line in inF:                         # for every line
        line = line.strip()                  # that is
        if line == "": continue              # not empty
        beginOfSequence = line[:NR_MATCHING_CHARS]
        if not (beginOfSequence in lines):   # and the beginning of this line was not printed yet
            print(line)                      # print the line
            lines.add(beginOfSequence)       # remember that the beginning of the line