我最近学习了一些Python以及如何将它应用到我的工作中。我已成功编写了几个脚本,但我遇到了一个我无法弄清楚的问题。
我打开的文件大约有4000行,每行有两个制表符分隔的列。在读取输入文件时,我收到索引错误,指出列表索引超出范围。然而,虽然我每次都得到错误,但每次都不会在同一行上发生错误(因为,每次都会在不同的行上抛出错误!)。因此,出于某种原因,它通常有效,但随后(看似)随机失败。
由于我上周才开始学习Python,我很难过。我一直在寻找相同的问题,但没有找到类似的东西。此外,我不知道这是一个特定于语言或IPython的问题。任何帮助将不胜感激!
input = open("count.txt", "r")
changelist = []
listtosort = []
second = str()
output = open("output.txt", "w")
for each in input:
splits = each.split("\t")
changelist = list(splits[0])
second = int(splits[1])
print second
if changelist[7] == ";":
changelist.insert(6, "000")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[8] == ";":
changelist.insert(6, "00")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[9] == ";":
changelist.insert(6, "0")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
else:
#output.write(str("".join(changelist)))
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
output.close()
错误
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/home/a/Desktop/sharedfolder/ipytest/individ.ins.count.test/<ipython-input-87-32f9b0a1951b> in <module>()
57 splits = each.split("\t")
58 changelist = list(splits[0])
---> 59 second = int(splits[1])
60
61 print second
IndexError: list index out of range
输入:
ID=cds0;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
期望的输出:
ID=cds0000;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
答案 0 :(得分:0)
当count.txt中有一行不包含制表符时会发生这种情况。因此,当您按制表符分割时,将不会有任何splits[1]
。因此错误“索引超出范围”。
要知道导致错误的行,只需在第57行print(each)
之后添加splits
。错误消息之前打印的行是您的罪魁祸首。如果您的输入文件不断变化,那么您将获得不同的位置。更改脚本以处理这些格式错误的行。
答案 1 :(得分:0)
您获得IndexError
的原因是您的输入文件显然不完全以制表符分隔。这就是为什么当你试图访问它时splits[1]
没有任何内容。
您的代码可能会使用一些重构。首先,你要重复if
- 检查,这是不必要的。这只会将cds0
填充到7个字符,这可能不是您想要的。我将以下内容放在一起,以演示如何重构您的代码,使其变得更加pythonic和干燥。我无法保证它可以与你的数据集一起使用,但我希望它可以帮助你理解如何以不同的方式做事。
to_sort = []
# We can open two files using the with statement. This will also handle
# closing the files for us, when we exit the block.
with open("count.txt", "r") as inp, open("output.txt", "w") as out:
for each in inp:
# Split at ';'... So you won't have to worry about whether or not
# the file is tab delimited
changed = each.split(";")
# Get the value you want. This is called unpacking.
# The value before '=' will always be 'ID', so we don't really care about it.
# _ is generally used as a variable name when the value is discarded.
_, value = changed[0].split("=")
# 0-pad the desired value to 7 characters. Python string formatting
# makes this very easy. This will replace the current value in the list.
changed[0] = "ID={:0<7}".format(value)
# Join the changed-list with the original separator and
# and append it to the sort list.
to_sort.append(";".join(changed))
# Write the results to the file all at once. Your test data already
# provided the newlines, you can just write it out as it is.
output.writelines(to_sort)
# Do what else you need to do. Maybe to_list.sort()?
你会注意到这段代码会将你的代码减少到8行,但实现完全相同的东西,不会重复,并且很容易理解。
请阅读PEP8,Zen的python,然后浏览official tutorial。