我有庞大的文本文件。它看起来如下
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000
.....
现在我想创建python脚本来查找像(> <Enzymologic: Ki nM 1>
,> <Enzymologic: EC50/IC50 nM 1>
这样的单词,并以制表符分隔格式打印每个单词的下一行,如下所示
> <Enzymologic: Ki nM 1> > <Enzymologic: EC50/IC50 nM 1>
257000 n/a
5000 1000
....
我尝试了以下代码
infile = path of the file
lines = infile.readlines()
infile.close()
searchtxt = "> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"
for i, line in enumerate(lines):
if searchtxt in line and i+1 < len(lines):
print lines[i+1]
但它不起作用,任何机构都可以提出一些代码......来实现它。
提前致谢
答案 0 :(得分:1)
s = '''Enzymologic: Ki nM 1
257000
Enzymologic: IC50 nM 1
n/a
ITC: Delta_G0 kJ/mole 1
n/a
Enzymologic: Ki nM 1
5000
Enzymologic: IC50 nM 1
1000'''
from collections import defaultdict
lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key, value in zip(keys, values):
result[key].append(value)
print dict(result)
>>> {'ITC: Delta_G0 kJ/mole 1': ['n/a'], 'Enzymologic: Ki nM 1': ['257000', '5000'], 'Enzymologic: IC50 nM 1': ['n/a', '1000']}
然后根据需要格式化输出。
答案 1 :(得分:1)
我认为您的问题来自if searchtxt in line
if pattern in line
,而不是pattern
searchtxt
>>> path = 'D:\\temp\\Test.txt'
>>> lines = open(path).readlines()
>>> searchtxt = "Enzymologic: IC50 nM 1", "Enzymologic: Ki nM 1"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i, line in enumerate(lines):
for pattern in searchtxt:
if pattern in line and i+1 < len(lines):
dict_patterns[pattern].append(lines[i+1])
>>> dict_patterns
defaultdict(<type 'list'>, {'Enzymologic: Ki nM 1': ['257000\n', '5000\n'],
'Enzymologic: IC50 nM 1': ['n/a\n', '1000']})
。这是我要做的:
defaultdict
使用dict允许按模式对结果进行分组({{1}}是一种不被强制初始化对象的便捷方式。)
答案 2 :(得分:0)
import itertools
def search(lines, terms):
results = [[t] for t in terms]
lines = iter(lines)
for l in lines:
for i,t in enumerate(terms):
if t in l:
results[i].append(lines.next().strip())
break
return results
def format(results):
s = []
rows = list(itertools.izip_longest(*results, fillvalue=""))
for row in rows:
s.append("\t".join(row))
s.append('\n')
return ''.join(s)
以下是调用函数的方法:
example = """> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000"""
def test():
terms = ["> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"]
lines = example.split('\n')
result = search(lines, terms)
print format(result)
>>> test() > <Enzymologic: IC50 nM 1> > <Enzymologic: Ki nM 1> n/a 257000
上面的示例将每个列分隔为一个选项卡。如果您需要更高级的格式化(根据您的示例),格式函数会变得更复杂:
import math
def format(results):
maxcolwidth = [0] * len(results)
rows = list(itertools.izip_longest(*results, fillvalue=""))
for row in rows:
for i,col in enumerate(row):
w = int(math.ceil(len(col)/8.0))*8
maxcolwidth[i] = max(maxcolwidth[i], w)
s = []
for row in rows:
for i,col in enumerate(row):
s += col
padding = maxcolwidth[i]-len(col)
tabs = int(math.ceil(padding/8.0))
s += '\t' * tabs
s += '\n'
return ''.join(s)
答案 3 :(得分:0)
你真的有太多不同的问题:
import itertools
# let's imitate a file
pseudo_file = """
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000
""".split('\n')
def iterate_on_couple(iterable):
"""
Iterate on two elements, by two elements
"""
iterable = iter(iterable)
for x in iterable:
yield x, next(iterable)
plain_lines = (l for l in pseudo_file if l.strip()) # ignore empty lines
results = {}
# store all results in a dictionary
for name, value in iterate_on_couple(plain_lines):
results.setdefault(name, []).append(value)
# now you got a dictionary with all values linked to a name
print results
现在这段代码假设您的文件没有损坏 你总是有结构:
如果不是,你可能需要更强大的东西。
其次,这会将所有值存储在内存中,如果这样可能会出现问题
你有很多价值观。在这种情况下,您需要查看一些存储空间
解决方案,例如shelve
模块或sqlite
。
import csv
def get(iterable, index, default):
"""
Return an item from array or default if IndexError
"""
try:
return iterable[index]
except IndexError:
return default
names = results.keys() # get a list of all names
# now we write our tab separated file using the csv module
out = csv.writer(open('/tmp/test.csv', 'w'), delimiter='\t')
# first the header
out.writerow(names)
# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]
# then write the lines one by one
for i in xrange(max_size):
line = [get(results[name], i, "-") for name in names]
out.writerow(line)
由于我正在为您编写完整的代码,因此我逐渐使用了一些高级Python习惯用法,因此您在使用它时会有一些想法。
答案 4 :(得分:0)
import re
pseudo_file = """
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000"""
searchtxt = "nzymologic: Ki nM 1>", "<Enzymologic: IC50 nM 1>"
regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')
tu = tuple(regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt)
model = '%%-%ss %%s\n' % len(searchtxt[0])
regx_BBB = re.compile(('%s[ \t\r\n]+(.+)[ \t\r\n]+'
'.+?%s[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)') % tu)
print 'tu ==',tu
print 'model==',model
print 'regx_BBB.findall(pseudo_file)==\n',regx_BBB.findall(pseudo_file)
with open('woof.txt','w') as f:
f.write(model % searchtxt)
f.writelines(model % x for x in regx_BBB.findall(pseudo_file))
结果
tu == ('nzymologic: .*?Ki.*? nM 1>', '<Enzymologic: .*?IC50.*? nM 1>')
model== %-20s %s
regx_BBB.findall(pseudo_file)==
[('257000', 'n/a'), ('5000', '1000')]
和'woof.txt'文件的内容是:
> <Enzymologic: Ki nM 1> > <Enzymologic: IC50 nM 1>
257000 n/a
5000 1000
要获得 regx_BBB ,我首先计算一个元组 tu ,因为您想要捕捉一行&gt; 但 searchtxt
只有“&gt;”因此,元组 tu 在 searchtxt 的字符串中引入。*?,以便正则表达式 regx_BBB 能够捕获包含 IC50 的行,而不仅仅是与 searchtxt
的元素严格相等的行请注意,我将字符串"nzymologic: Ki nM 1>"
和"<Enzymologic: IC50 nM 1>"
放在 searchtxt 中,而不是您使用的字符串,以表明正在构建正则表达式,以便获得结果。
唯一的条件是搜索文本
的每个字符串中的':前必须至少有一个字符
我认为在文件中,行'> <Enzymologic: IC50 nM 1>'
或'> <Enzymologic: EC50/IC50 nM 1>'
应始终遵循'> <Enzymologic: Ki nM 1>'
行
但在阅读了其他人的答案后,我认为这并不明显(这是问题的常见问题:他们没有提供足够的信息和精确度)
如果必须独立捕获每一行,可以使用以下更简单的正则表达式regx_BBB:
regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')
li = [ regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt]
regx_BBB = re.compile('|'.join(li).join('()') + '[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)')
但是录制文件的格式化会更难。我很难写一个新的完整代码,却不知道究竟需要什么
答案 5 :(得分:0)
在一行中查找字符串然后打印下一行的最简单方法可能是使用itertools islice:
from itertools import islice
searchtxt = "<Enzymologic: IC50 nM 1>"
with open ('file.txt','r') as itfile:
for line in itfile:
if searchtxt in line:
print line
print ''.join(islice(itfile,1)