如何在Python中提取类似的标题段落

时间:2015-11-30 20:41:54

标签: python text-extraction

我有几个文本文件,其格式如下

Technical :

localization lengths is observed at particular energies for an increasing binary backbone disorder. We comment on the possible biological relevance of sequence-dependent charge transfer in DNA

Work : 

We find that random and λ-DNA have localization lengths allowing for electron motion among a few dozen basepairs only.

Technical : 

We study the electronic properties of DNA by way of a tight-binding model applied to four particular DNA sequences. The charge transfer properties are presented in terms of localization lengths (crudely speaking, the length over which electrons travel.

Education :

Electronic, DNA sequence   

现在我想用标题" Technical"提取段落。使用我的代码我可以在两个标题之间提取特定段落,但不能提取具有相似标题的所有段落。

with open("aks.txt") as infile, open("fffm",'w') as outfile:
    copy = False
    for line in infile:
        if line.strip() == "Technical":
            copy = True
        elif line.strip() == "Work":
            copy = False
        elif copy:
            outfile.write(line)
        fh = open("fffm.txt", 'r')
        contents = fh.read()
        len(contents)

1 个答案:

答案 0 :(得分:0)

将正则表达式与 re 模块一起使用。请参阅:https://docs.python.org/2/library/re.html

此代码可以满足您的需求:

import re

the_text = """Technical :

localization lengths is observed at particular energies for an increasing binary backbone disorder. We comment on the possible biological relevance of sequence-dependent charge transfer in DNA

Work :

We find that random and λ-DNA have localization lengths allowing for electron motion among a few dozen basepairs only.

Technical :

We study the electronic properties of DNA by way of a tight-binding model applied to four particular DNA sequences. The charge transfer properties are presented in terms of localization lengths (crudely speaking, the length over which electrons travel.

Education :

Electronic, DNA sequence"""

for title, content in re.findall('(\w+) +?:\s+?(.+)', the_text):
    if title.lower() == "technical":
        print "Title: {}".format(title)
        print "Content: {}\n".format(content)

<强>输出:

Title: Technical
Content: localization lengths is observed at particular energies for an increasing binary backbone disorder. We comment on the possible biological relevance of sequence-dependent charge transfer in DNA

Title: Technical
Content: We study the electronic properties of DNA by way of a tight-binding model applied to four particular DNA sequences. The charge transfer properties are presented in terms of localization lengths (crudely speaking, the length over which electrons travel.