使用python以csv格式读取BLAST输出时出现错误

时间:2014-05-14 19:08:26

标签: python list csv blast

对于长期问题道歉,我一直在努力解决这个问题,但我无法弄清楚我做错了什么!我已经包含了一个数据示例,因此您可以看到我正在使用的内容。

我有BLAST搜索的数据输出如下:

    # BLASTN 2.2.29+                                            
# Query: Cryptocephalus androgyne                                           
# Database: SANfive                                         
# Fields: query id   subject id  % identity  alignment length    mismatches  gap opens   q. start    q. end  s. start    s. end  evalue  bit score
# 7 hits found                                          
Cryptocephalus  M00964:19:000000000-A4YV1:1:2110:23842:21326    99.6    250 1   0   125 374 250 1   1.00E-128   457
Cryptocephalus  M00964:19:000000000-A4YV1:1:1112:19704:18005    85.37   246 36  0   90  335 246 1   4.00E-68    255
Cryptocephalus  M00964:19:000000000-A4YV1:1:2106:14369:15227    77.42   248 50  3   200 444 245 1   3.00E-34    143
Cryptocephalus  M00964:19:000000000-A4YV1:1:2102:5533:11928 78.1    137 30  0   3   139 114 250 2.00E-17    87.9
Cryptocephalus  M00964:19:000000000-A4YV1:1:1110:28729:12868    81.55   103 19  0   38  140 104 2   6.00E-17    86.1
Cryptocephalus  M00964:19:000000000-A4YV1:1:1113:11427:16440    78.74   127 27  0   3   129 124 250 6.00E-17    86.1
Cryptocephalus  M00964:19:000000000-A4YV1:1:2110:12170:20594    78.26   115 25  0   3   117 102 216 1.00E-13    75
# BLASTN 2.2.29+                                            
# Query: Cryptocephalus aureolus                                            
# Database: SANfive                                         
# Fields: query id   subject id  % identity  alignment length    mismatches  gap opens   q. start    q. end  s. start    s. end  evalue  bit score
# 10 hits found                                         
Cryptocephalus  M00964:19:000000000-A4YV1:1:2111:20990:19930    97.2    250 7   0   119 368 250 1   1.00E-118   424
Cryptocephalus  M00964:19:000000000-A4YV1:1:1105:20676:23942    86.89   206 27  0   5   210 209 4   7.00E-61    231
Cryptocephalus  M00964:19:000000000-A4YV1:1:1113:6534:23125 97.74   133 3   0   1   133 133 1   3.00E-60    230
Cryptocephalus  M00964:21:000000000-A4WJV:1:2104:11955:19015    89.58   144 15  0   512 655 1   144 2.00E-46    183
Cryptocephalus  M00964:21:000000000-A4WJV:1:1109:14814:10240    88.28   128 15  0   83  210 11  138 2.00E-37    154
Cryptocephalus  M00964:21:000000000-A4WJV:1:1105:4530:13833 79.81   208 42  0   3   210 211 4   6.00E-37    152
Cryptocephalus  M00964:19:000000000-A4YV1:1:2108:13133:14967    98.7    77  1   0   1   77  77  1   2.00E-32    137
Cryptocephalus  M00964:19:000000000-A4YV1:1:1109:14328:3682 100 60  0   0   596 655 251 192 1.00E-24    111
Cryptocephalus  M00964:19:000000000-A4YV1:1:1105:19070:25181    100 53  0   0   1   53  53  1   8.00E-21    99
Cryptocephalus  M00964:19:000000000-A4YV1:1:1109:20848:27419    100 28  0   0   1   28  28  1   6.00E-07    52.8
# BLASTN 2.2.29+                                            
# Query: Cryptocephalus cynarae                                         
# Database: SANfive                                         
# Fields: query id   subject id  % identity  alignment length    mismatches  gap opens   q. start    q. end  s. start    s. end  evalue  bit score
# 2 hits found                                          
Cryptocephalus  M00964:21:000000000-A4WJV:1:2107:12228:15885    90.86   175 16  0   418 592 4   178 5.00E-62    235
Cryptocephalus  M00964:21:000000000-A4WJV:1:1110:20463:5044 84.52   168 26  0   110 277 191 24  2.00E-41    167

我将此保存为csv,再次显示在下面

# BLASTN 2.2.29+,,,,,,,,,,,
# Query: Cryptocephalus androgyne,,,,,,,,,,,
# Database: SANfive,,,,,,,,,,,
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 7 hits found,,,,,,,,,,,
Cryptocephalus,M00964:19:000000000-A4YV1:1:2110:23842:21326,99.6,250,1,0,125,374,250,1,1.00E-128,457
Cryptocephalus,M00964:19:000000000-A4YV1:1:1112:19704:18005,85.37,246,36,0,90,335,246,1,4.00E-68,255
Cryptocephalus,M00964:19:000000000-A4YV1:1:2106:14369:15227,77.42,248,50,3,200,444,245,1,3.00E-34,143
Cryptocephalus,M00964:19:000000000-A4YV1:1:2102:5533:11928,78.1,137,30,0,3,139,114,250,2.00E-17,87.9
Cryptocephalus,M00964:19:000000000-A4YV1:1:1110:28729:12868,81.55,103,19,0,38,140,104,2,6.00E-17,86.1
Cryptocephalus,M00964:19:000000000-A4YV1:1:1113:11427:16440,78.74,127,27,0,3,129,124,250,6.00E-17,86.1
Cryptocephalus,M00964:19:000000000-A4YV1:1:2110:12170:20594,78.26,115,25,0,3,117,102,216,1.00E-13,75
# BLASTN 2.2.29+,,,,,,,,,,,
# Query: Cryptocephalus aureolus,,,,,,,,,,,
# Database: SANfive,,,,,,,,,,,
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 10 hits found,,,,,,,,,,,
Cryptocephalus,M00964:19:000000000-A4YV1:1:2111:20990:19930,97.2,250,7,0,119,368,250,1,1.00E-118,424
Cryptocephalus,M00964:19:000000000-A4YV1:1:1105:20676:23942,86.89,206,27,0,5,210,209,4,7.00E-61,231
Cryptocephalus,M00964:19:000000000-A4YV1:1:1113:6534:23125,97.74,133,3,0,1,133,133,1,3.00E-60,230
Cryptocephalus,M00964:21:000000000-A4WJV:1:2104:11955:19015,89.58,144,15,0,512,655,1,144,2.00E-46,183
Cryptocephalus,M00964:21:000000000-A4WJV:1:1109:14814:10240,88.28,128,15,0,83,210,11,138,2.00E-37,154
Cryptocephalus,M00964:21:000000000-A4WJV:1:1105:4530:13833,79.81,208,42,0,3,210,211,4,6.00E-37,152
Cryptocephalus,M00964:19:000000000-A4YV1:1:2108:13133:14967,98.7,77,1,0,1,77,77,1,2.00E-32,137
Cryptocephalus,M00964:19:000000000-A4YV1:1:1109:14328:3682,100,60,0,0,596,655,251,192,1.00E-24,111
Cryptocephalus,M00964:19:000000000-A4YV1:1:1105:19070:25181,100,53,0,0,1,53,53,1,8.00E-21,99
Cryptocephalus,M00964:19:000000000-A4YV1:1:1109:20848:27419,100,28,0,0,1,28,28,1,6.00E-07,52.8

我设计了一个通过百分比标识的短脚本,如果它高于阈值,则找到queryID并将其添加到列表中,然后从列表中删除重复项。

import csv
from pylab import plot,show

#Making a function to see if a string is a number or not
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

#Importing the CSV file, using sniffer to check the delimiters used 
#In the first 1024 bytes

ImportFile = raw_input("What is the name of your import file? ")
csvfile = open(ImportFile, "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)

#Finding species over 98%

Species98 = []
Species95to97 = []
Species90to94 = []
Species85to89 = []
Species80to84 = []
Species75to79 = []
SpeciesBelow74 = []



for line in reader:
    if is_number(line[2])== True:
        if float(line[2])>=98:
            Species98.append(line[0])
        elif 97>=float(line[2])>=95:
            Species95to97.append(line[0])
        elif 94>=float(line[2])>=90:
            Species90to94.append(line[0])
        elif 89>=float(line[2])>=85:
            Species85to89.append(line[0])
        elif 84>=float(line[2])>=80:
            Species80to84.append(line[0])
        elif 79>=float(line[2])>=75:
            Species75to79.append(line[0])
        elif float(line[2])<=74:
            SpeciesBelow74.append(line[0])

def f7(seq):
    seen = set()
    seen_add = seen.add
    return [ x for x in seq if x not in seen and not seen_add(x)]


Species98=f7(Species98)
print len(Species98), "species over 98"

Species95to97=f7(Species95to97) #removing duplicates
search_set = set().union(Species98)
Species95to97 = [x for x in Species95to97 if x not in search_set]
print len(Species95to97), "species between 95-97"

Species90to94=f7(Species90to94)
search_set = set().union(Species98, Species95to97)
Species90to94 = [x for x in Species90to94 if x not in search_set]
print len(Species90to94), "species between 90-94"

Species85to89=f7(Species85to89)
search_set = set().union(Species98, Species95to97, Species90to94)
Species85to89 = [x for x in Species85to89 if x not in search_set]               
print len(Species85to89), "species between 85-89"

Species80to84=f7(Species80to84)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89)
Species80to84 = [x for x in Species80to84 if x not in search_set]               
print len(Species80to84), "species between 80-84"

Species75to79=f7(Species75to79)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89,Species80to84)
Species75to79 = [x for x in Species75to79 if x not in search_set]       
print len(Species75to79), "species between 75-79"

SpeciesBelow74=f7(SpeciesBelow74)
search_set = set().union(Species98, Species95to97, Species90to94, Species85to89,Species80to84, Species75to79)
SpeciesBelow74 = [x for x in SpeciesBelow74 if x not in search_set] 
print len(SpeciesBelow74), "species below 74"

#Finding species 95-97%

该脚本大部分时间都能正常运行,但每次都会出现以下错误

File "FindingSpeciesRepresentation.py", line 35, in <module>
    if is_number(line[2])== "True":
IndexError: list index out of range

但是如果我更改脚本以便打印line[2]它会打印出我所期望的所有身份。你知道会出现什么问题吗?再次为数据墙道歉。

这部分来自我之前的问题:Extracting BLAST output columns in CSV form with python

0 个答案:

没有答案