解析文件并从中提取数据

Question

我有庞大的文本文件。它看起来如下

> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000

.....

现在我想创建python脚本来查找像（> <Enzymologic: Ki nM 1>，> <Enzymologic: EC50/IC50 nM 1>这样的单词，并以制表符分隔格式打印每个单词的下一行，如下所示

> <Enzymologic: Ki nM 1>     > <Enzymologic: EC50/IC50 nM 1>
257000                       n/a
5000                         1000
....

我尝试了以下代码

infile = path of the file
lines = infile.readlines()
infile.close()
searchtxt = "> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"
for i, line in enumerate(lines): 
     if searchtxt in line and i+1 < len(lines):
         print lines[i+1]

但它不起作用，任何机构都可以提出一些代码......来实现它。

提前致谢

Answer 1

s = '''Enzymologic: Ki nM 1

257000

Enzymologic: IC50 nM 1

n/a

ITC: Delta_G0 kJ/mole 1

n/a

Enzymologic: Ki nM 1

5000

Enzymologic: IC50 nM 1

1000'''
from collections import defaultdict

lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key, value in zip(keys, values):
    result[key].append(value)
print dict(result)

>>> {'ITC: Delta_G0 kJ/mole 1': ['n/a'], 'Enzymologic: Ki nM 1': ['257000', '5000'], 'Enzymologic: IC50 nM 1': ['n/a', '1000']}

然后根据需要格式化输出。

Answer 2

我认为您的问题来自if searchtxt in line if pattern in line，而不是pattern searchtxt >>> path = 'D:\\temp\\Test.txt' >>> lines = open(path).readlines() >>> searchtxt = "Enzymologic: IC50 nM 1", "Enzymologic: Ki nM 1" >>> from collections import defaultdict >>> dict_patterns = defaultdict(list) >>> for i, line in enumerate(lines): for pattern in searchtxt: if pattern in line and i+1 < len(lines): dict_patterns[pattern].append(lines[i+1]) >>> dict_patterns defaultdict(<type 'list'>, {'Enzymologic: Ki nM 1': ['257000\n', '5000\n'], 'Enzymologic: IC50 nM 1': ['n/a\n', '1000']})。这是我要做的：

defaultdict

使用dict允许按模式对结果进行分组（{{1}}是一种不被强制初始化对象的便捷方式。）

Answer 3

import itertools

def search(lines, terms):
    results = [[t] for t in terms]
    lines = iter(lines)
    for l in lines:
        for i,t in enumerate(terms):
            if t in l:
                results[i].append(lines.next().strip())
                break
    return results

def format(results):
    s = []
    rows = list(itertools.izip_longest(*results, fillvalue=""))
    for row in rows:
        s.append("\t".join(row))
        s.append('\n')
    return ''.join(s)

以下是调用函数的方法：

example = """> <Enzymologic: Ki nM 1>
257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000"""

def test():
    terms = ["> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"]
    lines = example.split('\n')
    result = search(lines, terms)
    print format(result)

>>> test()
> <Enzymologic: IC50 nM 1>   > <Enzymologic: Ki nM 1>
n/a 257000

上面的示例将每个列分隔为一个选项卡。如果您需要更高级的格式化（根据您的示例），格式函数会变得更复杂：

import math

def format(results):
    maxcolwidth = [0] * len(results)
    rows = list(itertools.izip_longest(*results, fillvalue=""))
    for row in rows:
        for i,col in enumerate(row):
            w = int(math.ceil(len(col)/8.0))*8
            maxcolwidth[i] = max(maxcolwidth[i], w)

    s = []
    for row in rows:
        for i,col in enumerate(row):
            s += col
            padding = maxcolwidth[i]-len(col)
            tabs = int(math.ceil(padding/8.0))
            s += '\t' * tabs
        s += '\n'

    return ''.join(s)

Answer 4

你真的有太多不同的问题：

解析文件并从中提取数据

import itertools

# let's imitate a file
pseudo_file = """
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000
""".split('\n')

def iterate_on_couple(iterable):
  """
    Iterate on two elements, by two elements
  """
  iterable = iter(iterable)
  for x in iterable:
    yield x, next(iterable)

plain_lines = (l for l in pseudo_file  if l.strip()) # ignore empty lines

results = {}

# store all results in a dictionary
for name, value in iterate_on_couple(plain_lines):
  results.setdefault(name, []).append(value)

# now you got a dictionary with all values linked to a name
print results

现在这段代码假设您的文件没有损坏你总是有结构：

空白
名称
值

如果不是，你可能需要更强大的东西。

其次，这会将所有值存储在内存中，如果这样可能会出现问题你有很多价值观。在这种情况下，您需要查看一些存储空间解决方案，例如shelve模块或sqlite。

将结果保存到文件

import csv

def get(iterable, index, default):
  """
    Return an item from array or default if IndexError
  """
  try:
      return iterable[index]
  except IndexError:
      return default

names = results.keys() # get a list of all names

# now we write our tab separated file using the csv module
out = csv.writer(open('/tmp/test.csv', 'w'), delimiter='\t')

# first the header
out.writerow(names)

# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]

# then write the lines one by one
for i in xrange(max_size):
    line = [get(results[name], i, "-") for name in names]
    out.writerow(line)

由于我正在为您编写完整的代码，因此我逐渐使用了一些高级Python习惯用法，因此您在使用它时会有一些想法。

Answer 5

import re

pseudo_file = """
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000"""

searchtxt = "nzymologic: Ki nM 1>", "<Enzymologic: IC50 nM 1>"

regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')

tu = tuple(regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt)

model = '%%-%ss  %%s\n' % len(searchtxt[0])

regx_BBB = re.compile(('%s[ \t\r\n]+(.+)[ \t\r\n]+'
                       '.+?%s[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)') % tu)


print 'tu   ==',tu
print 'model==',model
print 'regx_BBB.findall(pseudo_file)==\n',regx_BBB.findall(pseudo_file)



with open('woof.txt','w') as f:
    f.write(model % searchtxt)
    f.writelines(model % x for x in regx_BBB.findall(pseudo_file))

结果

tu   == ('nzymologic: .*?Ki.*? nM 1>', '<Enzymologic: .*?IC50.*? nM 1>')
model== %-20s  %s

regx_BBB.findall(pseudo_file)==
[('257000', 'n/a'), ('5000', '1000')]

和'woof.txt'文件的内容是：

> <Enzymologic: Ki nM 1>  > <Enzymologic: IC50 nM 1>
257000                    n/a
5000                      1000

要获得 regx_BBB ，我首先计算一个元组 tu ，因为您想要捕捉一行＆gt; 但 searchtxt

只有“＆gt;”

因此，元组 tu 在 searchtxt 的字符串中引入。*？，以便正则表达式 regx_BBB 能够捕获包含 IC50 的行，而不仅仅是与 searchtxt

的元素严格相等的行

请注意，我将字符串"nzymologic: Ki nM 1>"和"<Enzymologic: IC50 nM 1>"放在 searchtxt 中，而不是您使用的字符串，以表明正在构建正则表达式，以便获得结果。

唯一的条件是搜索文本

的每个字符串中的'：前必须至少有一个字符

编辑1

我认为在文件中，行'> <Enzymologic: IC50 nM 1>'或'> <Enzymologic: EC50/IC50 nM 1>'应始终遵循'> <Enzymologic: Ki nM 1>'行

但在阅读了其他人的答案后，我认为这并不明显（这是问题的常见问题：他们没有提供足够的信息和精确度）

如果必须独立捕获每一行，可以使用以下更简单的正则表达式regx_BBB：

regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')

li = [ regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt]

regx_BBB = re.compile('|'.join(li).join('()') + '[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)')

但是录制文件的格式化会更难。我很难写一个新的完整代码，却不知道究竟需要什么

Answer 6

在一行中查找字符串然后打印下一行的最简单方法可能是使用itertools islice：

    from itertools import islice
    searchtxt = "<Enzymologic: IC50 nM 1>"
    with open ('file.txt','r') as itfile:
            for line in itfile:
                    if searchtxt in line:
                            print line
                            print ''.join(islice(itfile,1)

使用Python查找多个单词并打印下一行

6 个答案:

解析文件并从中提取数据

将结果保存到文件

编辑1