在python中打印第一段

时间:2016-01-02 22:18:57

标签: python text paragraph

我在一个文本文件中有一本书,我需要打印每个部分的第一段。我想如果我在\ n \ n和\ n之间找到了一个文本,我就能找到答案。这是我的代码,但它没有用。你能告诉我我哪里错了吗?

lines = [line.rstrip('\n') for line in open('G:\\aa.txt')]

check = -1
first = 0
last = 0

for i in range(len(lines)):
    if lines[i] == "": 
            if lines[i+1]=="":
                check = 1
                first = i +2
    if i+2< len(lines):
        if lines[i+2] == "" and check == 1:
            last = i+2
while (first < last):
    print(lines[first])
    first = first + 1

此外,我在stackoverflow中找到了一个代码,我也尝试了它,但它只打印了一个空数组。

f = open("G:\\aa.txt").readlines()
flag=False
for line in f:
        if line.startswith('\n\n'):
            flag=False
        if flag:
            print(line)
        elif line.strip().endswith('\n'):
            flag=True

我在下面分享了本书的一个示例部分。

土地之光

人类感兴趣的领域非常广泛,只是在我们的门外,它们一直在探索之中。它是动物智能领域。

在研究世界野生动物的各种兴趣中,没有一种超越对他们的思想,道德以及他们的心理过程结果所表现的行为的研究。

II

野生动物的温度&amp;个体性

我在这里要做的是,找到大写的行,并将它们全部放在一个数组中。然后,使用索引方法,我将通过比较我创建的这个数组的这些元素的索引来找到每个部分的第一段和最后一段。

输出应该是这样的:

人类感兴趣的领域非常广泛,只是在我们的门外,它们一直在探索之中。它是动物智能领域。

我在这里要做的是,找到大写的行,并将它们全部放在一个数组中。然后,使用索引方法,我将通过比较我创建的这个数组的这些元素的索引来找到每个部分的第一段和最后一段。

5 个答案:

答案 0 :(得分:8)

如果要对这些部分进行分组,可以使用空行作为分隔符使用itertools.groupby

from itertools import groupby
with open("in.txt") as f:
    for k, sec in groupby(f,key=lambda x: bool(x.strip())):
        if k:
            print(list(sec))

使用更多的itertools foo,我们可以使用大写标题作为分隔符来获取部分:

from itertools import groupby, takewhile

with open("in.txt") as f:
    grps = groupby(f,key=lambda x: x.isupper())
    for k, sec in grps:
        # if we hit a title line
        if k: 
            # pull all paragraphs
            v = next(grps)[1]
            # skip two empty lines after title
            next(v,""), next(v,"")

            # take all lines up to next empty line/second paragraph
            print(list(takewhile(lambda x: bool(x.strip()), v)))

哪会给你:

['There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.\n']
['What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.']

每个部分的开头都有一个全大写的标题,所以一旦我们知道有两个空行,那么第一个段落和模式会重复。

将其分解为使用循环:

from itertools import groupby  
from itertools import groupby
def parse_sec(bk):
    with open(bk) as f:
        grps = groupby(f, key=lambda x: bool(x.isupper()))
        for k, sec in grps:
            if k:
                print("First paragraph from section titled :{}".format(next(sec).rstrip()))
                v = next(grps)[1]
                next(v, ""),next(v,"")
                for line in v:
                    if not line.strip():
                        break
                    print(line)

为您的文字:

In [11]: cat -E in.txt

THE LAY OF THE LAND$
$
$
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.$
$
Of all the kinds of interest attaching to the study of the world's wild animals, there are none that surpass the study of their minds, their morals, and the acts that they perform as the results of their mental processes.$
$
$
WILD ANIMAL TEMPERAMENT & INDIVIDUALITY$
$
$
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

美元符号是新线,输出是:

In [12]: parse_sec("in.txt")
First paragraph from section titled :THE LAY OF THE LAND
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.

First paragraph from section titled :WILD ANIMAL TEMPERAMENT & INDIVIDUALITY
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

答案 1 :(得分:1)

总是正则表达式......

import re
with open("in.txt", "r") as fi:
    data = fi.read()
paras = re.findall(r"""
                   [IVXLCDM]+\n\n   # Line of Roman numeral characters
                   [^a-z]+\n\n      # Line without lower case characters
                   (.*?)\n          # First paragraph line
                   """, data, re.VERBOSE)
print "\n\n".join(paras)

答案 2 :(得分:0)

逐行查看您找到的代码。

f = open("G:\\aa.txt").readlines()
flag=False
for line in f:
        if line.startswith('\n\n'):
            flag=True
        if flag:
            print(line)
        elif line.strip().endswith('\n'):
            flag=True

似乎它永远不会将标志变量设置为true。

如果你可以分享你书中的一些样本,那么对每个人来说都会更有帮助。

答案 3 :(得分:0)

只要没有全部大写的段落,这应该有效:

storeid + status + year + month

如果你想获得最后一个段落,你可以跟踪上次看到包含小写字符的行,然后一找到全部大写行(I,II等),表示新的部分,然后你打印最近一行,因为这将是上一节中的最后一段。

答案 4 :(得分:0)

TXR解决方案

$ txr firstpar.txr data
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

<sql id="queryUserCondition"> where 1=1 </sql> <select id="countAll" resultType="int"> select count(*) from user_info <include refid="queryUserCondition" /> </select> <select id="findAll" resultMap="UserResultMap"> select <include refid="UserColumns" /> from user_info <include refid="queryUserCondition" /> </select> 中的代码:

@(repeat)
@num

@title

@firstpar
@  (require (and (< (length num) 5)
                 [some title chr-isupper]
                 (not [some title chr-islower])))
@  (do (put-line firstpar))
@(end)

基本上我们正在搜索输入以查找绑定firstpar.txrnumtitle变量的三元素多行模式的模式匹配。现在这种模式可以在错误的位置匹配,因此使用firstpar断言添加一些约束启发式。节号必须是一个短行,标题行必须包含一些大写字母,而不是小写字母。该表达式用TXR Lisp编写。

如果我们得到这个约束的匹配,那么我们输出在require变量中捕获的字符串。