Question

更新

我和一位同事谈到了这个问题，我们提出了以下解决方案......

#!/usr/bin/python python

in_file = open("small-output.txt", "rt")

with open("output-new.txt", "wt") as txtfile:
    sentence = ""
    hit = False
    for each in in_file:
        if each.strip() == "Description":
            hit = False
            txtfile.write(sentence + "\n")
            sentence = ""
        if hit == True: sentence += " " + each.strip()
        if each.strip() == "Title": hit = True
txtfile.close()
in_file.close()

这不是一个完美/优雅的解决方案，因为使用所有内联逗号写出.csv文件时会出现问题。所以，我最终做的只是使用上面的脚本写出文本文件，然后将其导入.csv。

理想情况下，输出看起来像

这是事物的标题，也是foo。

考虑到这一点，任何人都可以改进代码，以便每个捕获的句子都是电子表格中的一行吗？或者有没有人使用Python 2.7获得更优雅的解决方案？

结束更新：

我整个早上一直在浏览StackExchange，虽然我已经看过许多类似的解决方案，但我还没有找到一个完全适合我的参数。我正在尝试编写一个解析文本文件的脚本，在两个分隔符之间复制多行文本，然后将每个字符串集粘贴到.csv文件中。文本文件中的行如下所示：

a string
    ......
    a string
    another string, a string
    Title
    This
    is
    the
    title
    of
    the
    thing, 
    also foo.
    Description
    a string
    ..........
    another string

具体来说，我希望捕获'Title'和'Description'之间的所有内容，然后将其写入.csv文件。

这开始是一个非常大的PDF（10,000多页），已经使用pdfminer导出到文本文件中，并且有很多实例的分隔符;因此，理想情况下，输出将是许多行的单元格，其中包含一些句子。

到目前为止，我已经使用了Python 2.7和正则表达式，但我对其他* nix方法持开放态度，例如： awk，sed，grep等。

以下是我尝试过的一些非工作片段......

#!/usr/bin/python python
import re, csv

text_file = open('test_file.txt')

with open(text_file, 'wb') as fout:
    for result in re.findall('Description(.*?)Family', enb_document.read(), re.S):
             # fout.write(result)
fout.close()

def extractData():
    filename = ("test_file.txt")
    infile = open(filename,'r')
    startdelim = 'Description'
    enddelim = 'Family'

    for x in infile.readlines():
        x = x.strip()
        if x.startswith(startdelim):
            print >> sequence
        else:
            sequence = x
            if delim1.startswith(enddelim):



    infile.close()

extractData()

有什么想法吗？提前谢谢！

Answer 1

sed -n '/^Title$/,/^Description$/{//d;p}' file

输出

This
is
the
title
of
the
thing.

提取多行字符串

更新

结束更新：

1 个答案: