我正在做一些生物信息学研究,我是python的新手。我写了这段代码来解释含有蛋白质序列的文件。文件“bulk_sequences.txt”本身包含71,423行信息。三行是指一个蛋白质序列,第一行给出信息,包括发现蛋白质的年份(这就是“/ 1945”的内容)。“使用1000行的较小样本,它可以正常工作。但是我用这个大文件,似乎需要很长时间。有什么我可以做的来简化这个吗?
这是为了对文件进行排序,按发现年份对其进行排序,然后将所有三行蛋白质序列数据分配到数组“sortedsqncs”中的项目
import time
start = time.time()
file = open("bulk_sequences.txt", "r")
fileread = file.read()
bulksqncs = fileread.split("\n")
year = 1933
newarray = []
years = []
thirties = ["/1933","/1934","/1935","/1936","/1937","/1938","/1939","/1940","/1941","/1942"]## years[0]
forties = ["/1943","/1944","/1945","/1946","/1947","/1948","/1949","/1950","/1951","/1952"]## years[1]
fifties = ["/1953","/1954","/1955","/1956","/1957","/1958","/1959","/1960","/1961","/1962"]## years[2]
sixties = ["/1963","/1964","/1965","/1966","/1967","/1968","/1969","/1970","/1971","/1972"]## years[3]
seventies = ["/1973","/1974","/1975","/1976","/1977","/1978","/1979","/1980","/1981","/1982"]## years[4]
eighties = ["/1983","/1984","/1985","/1986","/1987","/1988","/1989","/1990","/1991","/1992"]## years[5]
nineties = ["/1993","/1994","/1995","/1996","/1997","/1998","/1999","/2000","/2001","/2002"]## years[6]
twothsnds = ["/2003","/2004","/2005","/2006","/2007","/2008","/2009","/2010","/2011","/2012"]## years[7]
years = [thirties,forties,fifties,sixties,seventies,eighties,nineties,twothsnds]
count = 0
sortedsqncs = []
for x in range(len(years)):
for i in range(len(years[x])):
for y in bulksqncs:
if years[x][i] in y:
for n in range(len(bulksqncs)):
if y in bulksqncs[n]:
sortedsqncs.append(bulksqncs[n:n+3])
count +=1
print len(sortedsqncs)
end = time.time()
print round((end - start),4)
答案 0 :(得分:5)
但是tcaswell基本上是正确的,你在文件上循环的次数太多了。其他低效率,至少从可读性和可维护性的角度来看,是预定义的年份数组。你也应该永远不要使用range(len(seq))
- 几乎总是有更好的(更加pythonic)方式。最后,如果您需要文件中的行列表,请使用readlines()
。
更行人的解决方案是:
根据tcaswell的建议写一个函数extract_year(),从输入行(bulksqncs)返回年份,如果没有找到年份则返回None。您可以使用正则表达式,或者如果您知道行中年份的位置,请使用它。
循环输入并提取所有序列,将每个序列分配给一个元组(年份,三行序列)并将元组添加到列表中。这也允许输入具有散布序列的非序列的文件。
按年份对元组列表进行排序。
从元组的排序列表中提取序列。
示例代码 - 这将为您提供排序序列的Python列表:
bulksqncs = infile.readlines()
sq_tuple = []
for idx, line in enumerate(bulksqncs):
if extract_year(line):
sq_tuple.append((extract_year(line), bulksqncs[idx:idx+3]))
sq_tuple.sort()
sortedsqncs = ['\n'.join(item[1]) for item in sq_tuple]
答案 1 :(得分:4)
问题是你在你的巨型文件上循环次数荒谬。您可以一次完成此操作:
from itertools import izip_longest
#http://docs.python.org/2/library/itertools.html
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
# fold your list into a list of length 3 tuples
data = [n for n in grouper(bulksqncs, 3)]
# sort the list
# tuples will 'do the right thing' by default if the line starts with the year
data.sort()
如果您的年度行未以年份开头,则需要使用key
kwarg sort
data.sort(key=lamdba x: extract_year(x[0]))
答案 2 :(得分:3)
问题在于,每当你在一行中找到一年时,你会在另一个时间(for n in range(len(bulksqncs))
)循环浏览文件,这样你总共会得到1310亿(= 71423 *(71423/3) )* 80)迭代。您可以将此减少到600万以下(71423 * 80),这仍然需要一些时间,但应该是可管理的。
对主循环的一个简单修复是使用enumerate
来获取行号,而不是必须从头开始遍历整个文件:
for decade in decades:
for year in decade:
for n, line in enumerate(bulksqncs):
if year in line:
sortedsqncs.append(bulksqncs[n:n + 3])
count += 1
但是,通过将years循环放在从文件中读取行的循环内,可以进一步减少时间。我会考虑使用字典,并从文件中一次读取一行(而不是使用read()
一次读取整个内容)。当您在该行中找到一年时,您可以使用next
来获取接下来的两行以及您当前所在的行。该程序然后break
多年来循环,避免不必要的迭代(假设在同一行中不可能有超过一年的时间)。
years = ['/' + str(y) for y in range(1933, 2013)]
sequences = dict((year, []) for year in years)
with open("bulk_sequences.txt", "r") as bulk_sequences:
for line in bulk_sequences:
for year in years:
if year in line:
sequences[year].append((line,
bulk_sequences.next(),
bulk_sequences.next()))
break
然后可以按
获取排序列表[sequences[year] for year in years]
或者使用OrderedDict
来保持序列的顺序。