我正在分析基因组测序数据并遇到无法识别的问题。我正在使用包含大约500万个序列读取的输入fastq文件,如下所示:
1 Unique Header #Read 1
2 AAAAAA.....AAAAAA #Sequence Read 1
3 +
4 ??AA@F #Quality of Read 1
5 Unique Header #Read 2
6 ATTAAA.....AAAAAA
7 +
8 >>AA?B
9 Unique Header #Read 3
10 ATAAAA.....AAAAAA
11 +
12 >>AA?B
然后想法迭代这个文件并比较序列读取行(上面第2行和第6行)。如果序列的第一个和最后六个字符足够唯一(Levenshtein距离为2),则将完整序列及其相应的三行写入输出文件。否则,它会被忽略。
当我使用一个小的测试文件时,我的代码似乎会这样做,但是当我然后分析一个完整的fastq文件时,似乎有两个很多的序列被写入输出文件。
我的代码如下,任何帮助将不胜感激。感谢
代码:
def outputFastqSimilar():
target = open(output_file, 'w') #Final output file that will contain only matching acceptable reads/corresponding data
with open(current_file, 'r') as f: #This is the input fastq
lineCharsList = [] #Contains unique concatenated strings of first and last 6 chars of read line
headerLine = next(f) #Stores the header information for each line
counter = 1
for line in f:
if counter == 1:
lineChars = line[0:6]+line[145:151] #Identify first concatenated string of first/last 6 chars
lineCharsList.append(lineChars)
#Write first read/info to output
target.write(headerLine)
target.write(line)
nextLine = next(f)
target.write(nextLine)
nextLine = next(f)
target.write(nextLine)
headerLine = next(f) #Move to next header
counter+=1
elif counter > 1:
lineChars = line[0:6]+line[145:151] #Get first/last six chars from next read
different_enough = True
for i in lineCharsList: #Iterate through list and compare with current read
if distance(lineChars, i) < 2: #Levenshtein distance
different_enough = False
for skip in range(3): #If read too similar, skip over it
try:
check = line #Check for additional lines in file
headerLine = next(f) #Move to next header
except StopIteration:
break
elif distance(lineChars, i) >= dist_stringency & different_enough == True: #If read is unique enough, write to output
lineCharsList.append(lineChars)
target.write(headerLine)
target.write(line)
nextLine = next(f)
target.write(nextLine)
nextLine = next(f)
target.write(nextLine)
try:
check = line
headerLine = next(f)
except StopIteration:
break
target.close()
测试文件的期望输出将是以下,其中所有读取都是唯一的,但是在线10上的读取具有Levenshtein距离&lt; 2到第2行的读取因此不会包含在输出中:
1 Unique Header #Read 1
2 AAAAAA.....AAAAAA #Sequence Read 1
3 +
4 ??AA@F #Quality of Read 1
5 Unique Header #Read 2
6 ATTAAA.....AAAAAA
7 +
8 >>AA?B
答案 0 :(得分:1)
您似乎正在测试每次阅读是否与任何之前的阅读不同,但您真正想要的是与所有> 以前的读物。
您可以在进入此循环之前设置标记Target: sparc-sun-solaris2.10
Configured with: /home/dam/mgar/pkg/gcc4/trunk/work/solaris10-sparc/build-isa-sparcv8plus/gcc-4.9.2/configure --prefix=/opt/csw --exec_prefix=/opt/csw --bindir=/opt/csw/bin --sbindir=/opt/csw/sbin --libexecdir=/opt/csw/libexec --datadir=/opt/csw/share --sysconfdir=/etc/opt/csw --sharedstatedir=/opt/csw/share --localstatedir=/var/opt/csw --libdir=/opt/csw/lib --infodir=/opt/csw/share/info --includedir=/opt/csw/include --mandir=/opt/csw/share/man --enable-cloog-backend=isl --enable-java-awt=xlib --enable-languages=ada,c,c++,fortran,go,java,objc --enable-libada --enable-libssp --enable-nls --enable-objc-gc --enable-threads=posix --program-suffix=-4.9 --with-cloog=/opt/csw --with-gmp=/opt/csw --with-included-gettext --with-ld=/usr/ccs/bin/ld --without-gnu-ld --with-libiconv-prefix=/opt/csw --with-mpfr=/opt/csw --with-ppl=/opt/csw --with-system-zlib=/opt/csw --with-as=/usr/ccs/bin/as --without-gnu-as
Thread model: posix
gcc version 4.9.2 (GCC)
:different_enough = True
然后,当您测试for i in lineCharsList:
是否将其设置为distance(lineChars, i) < 2
时。
不要在循环内打印任何内容,等到它完成后再检查different_enough = False
的状态。如果您的读取通过了每次比较,它仍然是True,所以打印出读取。如果即使一个读数太相似也会是假的。
这样,只有在通过每次比较时才会打印读数。