我编写了一个从大文本文件中提取特定文本块的函数,示例文本如下所示:
ATP(1):C39(3) - A:TYR(58):CD2(67)
ATP(1):C39(3) - A:TYR(58):CE2(69)
ATP(1):C59(6) - A:ILE(61):CD1(100)
ATP(1):C59(6) - A:LYS(87):CE(344)
Hydrogen bonds:
Location of Donor | Sidechain/Backbone | Secondary Structure | Count
-------------------|--------------------|---------------------|-------
LIGAND | SIDECHAIN | OTHER | 1
RECEPTOR | BACKBONE | BETA | 1
Raw data:
ATP(1):O2A(9) - A:ILE(61):HN(93) - A:ILE(61):N(92)
Hydrophobic contacts (C-C):
Sidechain/Backbone | Secondary Structure | Count
--------------------|---------------------|-------
SIDECHAIN | OTHER | 2
SIDECHAIN | BETA | 23
Raw data:
ATP(1):C39(3) - A:TYR(58):CD2(67)
ATP(1):C39(3) - A:TYR(58):CE2(69)
ATP(1):C59(6) - A:ILE(61):CD1(100)
ATP(1):C59(6) - A:LYS(87):CE(344)
ATP(1):C4(23) - A:PHE(209):CD1(1562)
ATP(1):C4(23) - A:PHE(209):CE1(1564)
ATP(1):C2(26) - A:PHE(209):CD2(1563)
ATP(1):C6(28) - A:PHE(209):CB(1560)
ATP(1):C6(28) - A:PHE(209):CG(1561)
ATP(1):C6(28) - A:PHE(209):CD1(1562)
ATP(1):C6(28) - A:VAL(286):CG2(2266)
pi-pi stacking interactions:
ATP(1):C8(30) - A:LYS(87):CG(342)
ATP(1):C8(30) - A:GLU(159):CD(1066)
ATP(1):C8(30) - A:PHE(209):CE1(1564)
我写了一个提取块的函数:
from itertools import islice
def start_end_points(file_name):
f = open(file_name)
lines = f.readlines()
for s, line in enumerate(lines):
if "Hydrogen bonds:" in line:
print s
for e, line in enumerate(lines):
if "pi-pi stacking interactions:" in line:
print e
print islice(lines, s, e)
start_end_points("foo.txt")
有没有办法更有效地编写此代码?因为我想将此代码用作Web工具的一部分,因此代码的效率非常重要。
感谢。
答案 0 :(得分:4)
您没有理由将整个文件加载到内存中!
def start_end_points(file_name):
with open(file_name) as f:
found = False
for line in f:
if found or ("Hydrogen bonds:" in line):
found = True
print line
if "pi-pi stacking interactions:" in line:
break
start_end_points("foo.txt")
这样你在内存中只保留一个缓冲区,处理每一行,并在你到达 pi-pi ... 行后立即停止读取文件。
答案 1 :(得分:1)
您甚至不必将所有行保存到内存中!
with
有效地自动关闭文件,因此非常有效且有用。
注意这两个选项 - 如果一切都与效率相关,请选择第一个。
我建议 return
这些行而不是print
- 可能你会额外使用它,然后你可以再次打印,并且不再运行整个功能。
def start_end_points(file_name):
wanted_text = ""
# USE this way -EFFICIENT!
with open(file_name, "r") as f:
found = False
for line in f:
if found:
if "pi-pi stacking interactions:" in line:
break
else:
wanted_text += line
if "Hydrogen bonds:" in line:
wanted_text += line
found = True
# OR use this way *less efficient memory speaking*, but pythonic
with open(file_name, "r") as f:
all = f.read().split('\n')
numbers = [i for i, line in enumerate(all) if "Hydrogen bonds:" in line or "pi-pi stacking interactions:" in line]
wanted_text = all[numbers[0]:numbers[1]]
# eventually, return:
return wanted_text
data = start_end_points("foo.txt")
答案 2 :(得分:1)
我认为这样做效率更高,因为您可以迭代f
,因此您可以自行保存此列表转换lines = f.readlines()
。此代码执行只运行一次通过数据(使用2 while循环),其中您的代码使用2 for循环运行到文件的末尾。
from pprint import pprint
def start_end_points(file_name):
f = open(file_name)
single_line = next(f)
while "Hydrogen bonds:" not in single_line:
single_line = next(f)
result = []
while "pi-pi stacking interactions:" not in single_line:
result.append(single_line.rstrip())
single_line = next(f)
f.close()
pprint(result)
需要注意的重要事项:打开文件后,您仍然可以修改它。因此,您在while
循环中阅读的行可能不是您在打开f
时想到的行。
输出btw:
['Hydrogen bonds:',
' Location of Donor | Sidechain/Backbone | Secondary Structure | Count',
' -------------------|--------------------|---------------------|-------',
' LIGAND | SIDECHAIN | OTHER | 1',
'',
' RECEPTOR | BACKBONE | BETA | 1',
'',
'Raw data:',
' ATP(1):O2A(9) - A:ILE(61):HN(93) - A:ILE(61):N(92)',
'',
'Hydrophobic contacts (C-C):',
' Sidechain/Backbone | Secondary Structure | Count',
' --------------------|---------------------|-------',
' SIDECHAIN | OTHER | 2',
' SIDECHAIN | BETA | 23',
'',
'Raw data:',
' ATP(1):C39(3) - A:TYR(58):CD2(67)',
' ATP(1):C39(3) - A:TYR(58):CE2(69)',
' ATP(1):C59(6) - A:ILE(61):CD1(100)',
' ATP(1):C59(6) - A:LYS(87):CE(344)',
' ATP(1):C4(23) - A:PHE(209):CD1(1562)',
' ATP(1):C4(23) - A:PHE(209):CE1(1564)',
' ATP(1):C2(26) - A:PHE(209):CD2(1563)',
' ATP(1):C6(28) - A:PHE(209):CB(1560)',
' ATP(1):C6(28) - A:PHE(209):CG(1561)',
' ATP(1):C6(28) - A:PHE(209):CD1(1562)',
' ATP(1):C6(28) - A:VAL(286):CG2(2266)',
'']