Question

我正在分析基因组测序数据并遇到无法识别的问题。我正在使用包含大约500万个序列读取的输入fastq文件，如下所示：

1    Unique Header    #Read 1
2    AAAAAA.....AAAAAA    #Sequence Read 1
3    +
4    ??AA@F    #Quality of Read 1
5    Unique Header    #Read 2
6    ATTAAA.....AAAAAA
7    +
8    >>AA?B
9    Unique Header    #Read 3
10   ATAAAA.....AAAAAA
11   +
12   >>AA?B

然后想法迭代这个文件并比较序列读取行（上面第2行和第6行）。如果序列的第一个和最后六个字符足够唯一（Levenshtein距离为2），则将完整序列及其相应的三行写入输出文件。否则，它会被忽略。

当我使用一个小的测试文件时，我的代码似乎会这样做，但是当我然后分析一个完整的fastq文件时，似乎有两个很多的序列被写入输出文件。

我的代码如下，任何帮助将不胜感激。感谢

代码：

def outputFastqSimilar():
    target = open(output_file, 'w') #Final output file that will contain only matching acceptable reads/corresponding data

    with open(current_file, 'r') as f: #This is the input fastq
        lineCharsList = [] #Contains unique concatenated strings of first and last 6 chars of read line

        headerLine = next(f)  #Stores the header information for each line
        counter = 1

        for line in f:
            if counter == 1: 
                lineChars = line[0:6]+line[145:151] #Identify first concatenated string of first/last 6 chars
                lineCharsList.append(lineChars)

                #Write first read/info to output
                target.write(headerLine)
                target.write(line)
                nextLine = next(f)
                target.write(nextLine)
                nextLine = next(f)
                target.write(nextLine)
                headerLine = next(f)    #Move to next header
                counter+=1

            elif counter > 1:
                lineChars = line[0:6]+line[145:151] #Get first/last six chars from next read

                different_enough = True
                for i in lineCharsList: #Iterate through list and compare with current read
                    if distance(lineChars, i) < 2: #Levenshtein distance
                        different_enough = False
                        for skip in range(3): #If read too similar, skip over it
                            try:
                                check = line #Check for additional lines in file
                                headerLine = next(f) #Move to next header
                            except StopIteration:
                                break

                    elif distance(lineChars, i) >= dist_stringency & different_enough == True: #If read is unique enough, write to output
                        lineCharsList.append(lineChars)
                        target.write(headerLine)
                        target.write(line)
                        nextLine = next(f)
                        target.write(nextLine)
                        nextLine = next(f)
                        target.write(nextLine)
                        try:
                            check = line
                            headerLine = next(f)
                        except StopIteration:
                            break

    target.close()

测试文件的期望输出将是以下，其中所有读取都是唯一的，但是在线10上的读取具有Levenshtein距离＆lt; 2到第2行的读取因此不会包含在输出中：

1    Unique Header    #Read 1
2    AAAAAA.....AAAAAA    #Sequence Read 1
3    +
4    ??AA@F    #Quality of Read 1
5    Unique Header    #Read 2
6    ATTAAA.....AAAAAA
7    +
8    >>AA?B

Answer 1

您似乎正在测试每次阅读是否与任何之前的阅读不同，但您真正想要的是与所有以前的读物。

您可以在进入此循环之前设置标记Target: sparc-sun-solaris2.10 Configured with: /home/dam/mgar/pkg/gcc4/trunk/work/solaris10-sparc/build-isa-sparcv8plus/gcc-4.9.2/configure --prefix=/opt/csw --exec_prefix=/opt/csw --bindir=/opt/csw/bin --sbindir=/opt/csw/sbin --libexecdir=/opt/csw/libexec --datadir=/opt/csw/share --sysconfdir=/etc/opt/csw --sharedstatedir=/opt/csw/share --localstatedir=/var/opt/csw --libdir=/opt/csw/lib --infodir=/opt/csw/share/info --includedir=/opt/csw/include --mandir=/opt/csw/share/man --enable-cloog-backend=isl --enable-java-awt=xlib --enable-languages=ada,c,c++,fortran,go,java,objc --enable-libada --enable-libssp --enable-nls --enable-objc-gc --enable-threads=posix --program-suffix=-4.9 --with-cloog=/opt/csw --with-gmp=/opt/csw --with-included-gettext --with-ld=/usr/ccs/bin/ld --without-gnu-ld --with-libiconv-prefix=/opt/csw --with-mpfr=/opt/csw --with-ppl=/opt/csw --with-system-zlib=/opt/csw --with-as=/usr/ccs/bin/as --without-gnu-as Thread model: posix gcc version 4.9.2 (GCC)：different_enough = True

然后，当您测试for i in lineCharsList:是否将其设置为distance(lineChars, i) < 2时。

不要在循环内打印任何内容，等到它完成后再检查different_enough = False的状态。如果您的读取通过了每次比较，它仍然是True，所以打印出读取。如果即使一个读数太相似也会是假的。

这样，只有在通过每次比较时才会打印读数。

迭代行并在Python中进行比较

1 个答案: