我有一个表格格式的大型blast文件,目标序列的数量不受限制,因此需要很长时间才能解析。我想将每个查询序列的命中数减少到前10个。 我的python是基本的,但这是我到目前为止所拥有的
import sys
blastfile = open(sys.argv[1],"r")
column1list=[]
for line in blastfile:
b = line.split()[0]
column1list.append(b)
uniqcolumn1 = list(set(column1list))
counter = 0
for val in uniqcolumn1:
#print val
for line in blastfile:
#print line
while counter <= 10:
if line.startswith(val):
print line
counter =+ 1
这是一个blast输出文件的一行示例,查询序列的标题位于第一列,在本例中为'c8208_g1_i2'
c8208_g1_i2 gi|851252702|ref|WP_048131971.1| 79.30 797 165 0 4881 2491 1 797 0.0 1336 acetyl-CoA decarbonylase/synthase complex subunit alpha [Methanosaeta concilii]
我认为代码的第一部分工作正常,直至' uniqcolumn1 = list(set(column1list))',然后我无法打印从列表中的每个字符串开始的前十行。
答案 0 :(得分:2)
这里的问题似乎是你正在迭代你的文件对象两次。在Python中,文件对象的工作方式与读取每一行的指针非常相似。如果不将指针移回,则无需阅读。
您需要做的是使用.seek
函数将此指针移回开头。例如,假设您有file_to_read.txt
和python_script.py
。
<强> file_to_read.txt 强>
Hello! My name is Bob and I can't think of anything to
put in this file so I'm blabbering on about nonsense
in hopes that you won't realise that this text is not
important but the code in the actually file, though I
think that you wouldn't mind reading this long file.
<强> python_script.py 强>
f = open("file_to_read.txt", "r")
for line in f: print line
for line in f: print line
如果您要运行此代码(并且没有关于目录的错误),您只能打印file_to_read.txt
一次。要解决此问题,您只需在阅读之间添加f.seek(0, 0)
即可。例如:
f = open("file_to_read.txt", "r")
for line in f: print line
f.seek(0, 0)
for lien in f: print line
现在,回到上下文,您可以看到这对您的代码有何影响:
import sys
# Here is your reading of file
blastfile = open(sys.argv[1],"r")
column1list = []
# Here is the first time you read the file
for line in blastfile:
b = line.split()[0]
column1list.append(b)
# Add a line to move back to the start before the
# next reading
blastfile.seek(0, 0)
uniqcolumn1 = list(set(column1list))
for val in uniqcolumn1:
# Move the counter inside to refresh it after every iteration
counter = 0
# Here is the second time you read your file
for line in blastfile:
while counter <= 10:
if line.startswith(val):
print line
counter += 1
# Since you are going to read the file the next iteration,
# .seek the file
blastfile.seek(0, 0)
修改强>
以下是代码的后半部分,已修复。您可以这样做:
for val in uniqcolumn1:
# Move the counter in
counter = 0
# Move the while loop out
while counter <= 10:
for line in blastfile:
if line.startswith(val):
print line,
counter += 1
blastfile.seek(0, 0)
这样做的好处是for循环更早终止,它不读取整个文件。
另一种方法是使用它:
for val in uniqcolumn1:
# Move counter in
counter = 0
# Remove while statement
for line in blastfile:
# Add additional condition to if statement
if line.startswith(val) and counter <= 10:
print line,
counter += 1
elif counter > 10:
break
blastfile.seek(0, 0)
这样做的好处是它看起来更简单。
答案 1 :(得分:1)
此单程版本按照文件中出现的顺序打印每个标题的前10个:
import sys
NUM_TO_PRINT=10 # good practice - use names rather than raw numbers
blastfile = open(sys.argv[1],"r")
titles={}; # an empty dictionary.
# This will map titles to counts of how many times a line with that title
# has been printed.
for line in blastfile:
title = line.split()[0]; # assuming the title is space-delimited, and that the line is not empty
num_printed = titles.get(title, 0); # 0 is the default
if num_printed<NUM_TO_PRINT:
print line, # comma because _line_ already has a newline - without the comma, you get a blank line after every printed line
num_printed += 1
titles[title] = num_printed # save where we are