我已经生成了一个编辑过的DNA测序文件,该文件在不同的行上有单独的读数。并且想要消除那些在另一行的一个字符内匹配的那些。
输入文件:
AAAAAAAAAAAA #Start checking at line 1
TTTTTTTTTTTT #Diff by >1 char: Keep
AAAAACAAAAAA #Diff by 1 char: Delete
AAAAACAAACAA #Diff by 2 char: Keep
AAAAAAAAAAAA #Diff by <1 char: Delete
输出文件:
AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA
到目前为止我所拥有的:
with open(current_file, 'r') as f:
lineCharsList = []
outLines = []
for line in f:
lineChars = line[:]
if not (lineChars in lineCharsList): #exactly matches lines, need partial matching
lineCharsList.append(lineChars)
outLines.append(line)
print line
答案 0 :(得分:2)
pip install python-levenshtein
并使用函数Levenshtein.hamming
来比较字符串。
hamming(string1, string2)
计算两个琴弦的汉明距离。汉明距离只是不同字符的数量。 这意味着字符串的长度必须相同。
示例:
>>> hamming('Hello world!', 'Holly grail!') 7 >>> hamming('Brian', 'Jesus') 5
代码是:
import Levenshtein
input_lines = [
"AAAAAAAAAAAA",
"TTTTTTTTTTTT", # Diff by >1 char: Keep
"AAAAACAAAAAA", # Diff by 1 char: Delete
"AAAAACAAACAA", # Diff by 2 char: Keep
"AAAAAAAAAAAA", # Diff by <1 char: Delete
]
output_lines = []
for current_line in input_lines:
for previous_line in output_lines:
if Levenshtein.hamming(previous_line, current_line) < 2:
break
else:
output_lines.append(current_line)
print('\n'.join(output_lines))
输出:
AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA
答案 1 :(得分:1)
你已经得到了一个很好的答案。
这是我在基本python中的实现
with open(current_file, 'r') as f:
outlines = []
for line in f:
z = zip(line, *[el for el in outlines])
matches = [el[0] in el[1:] for el in z]
if matches.count(False) > 1:
outlines.append(line)