我正在尝试删除文件中以相同的5个字符开头的行,但是前5个字符是随机的(我不知道它们将是什么)吗?
我有一个代码读取文件第一行的最后5个字符,并将它们与文件中具有相同5个字符的随机行上的FIRST 5个字符进行匹配。问题是,当有两个或多个匹配项的前5个字符相同时,代码就会混乱。我需要读取文件中所有行并删除具有相同的5个首字符的两行之一的内容。
示例(问题):
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT
从文件中取出一个文件后,我需要什么:
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
(无第三行)
如果您能用言语解释我该如何做,我将不胜感激。
答案 0 :(得分:0)
例如,您可以这样做:
FILE_NAME = "data.txt" # the name of the file to read in
NR_MATCHING_CHARS = 5 # the number of characters that need to match
lines = set() # a set of lines that contain the beginning of the lines that have already been outputted
with open(FILE_NAME, "r") as inF: # open the file
for line in inF: # for every line
line = line.strip() # that is
if line == "": continue # not empty
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines): # and the beginning of this line was not printed yet
print(line) # print the line
lines.add(beginOfSequence) # remember that the beginning of the line