Python:从包含多个文本的文本文件中创建单独文本的列表

时间:2018-08-30 16:06:05

标签: python list text

我有一个.txt文件,其中包含4个文本,我想创建一个列表,其中所有for文本都会出现在新行上-因此,我在列表中将有4个对象。代码应该说些什么:逐行阅读文本(但是将行添加到文档中),但是一旦您获得“ x的1个文档”,就要开始新的一行。我已经尝试了以下方法,但不能创建我想要的东西:

with open('testfile.txt') as f:

    myList = f.readlines()

myList = [x.strip() for x in content]

testfile.txt

1 doc of 4

Hello World. 
This is another question


2 doc of 4

This is a new text file. 
Not much in it.

3 doc of 4

This is the third text. 
It contains separate info.

4 doc of 4

The final text. 
A short one.

myList的预期输出:

myList=['Hello World. This is another question',

        'This is a new text file. Not much in it.',

        'This is the third text. It contains separate info.',

        'The final text. A short one.']

1 个答案:

答案 0 :(得分:0)

好的。

类似的事情会发生–但是,如果文档不是以标题行开头,则会 崩溃。

import re

# This will hold each document as a list of lines.
# To begin with, there are no documents.
myList = []

# Define a regular expression to match header lines.
header_line_re = re.compile(r'\d+ doc of \d+')

with open('testfile.txt') as f:
    for line in f:  # For each line...
        line = line.strip()  # Remove leading and trailing whitespace
        if header_line_re.match(line):  # If the line matches the header line regular expression...
            myList.append([])  # Start a new group within `myList`,
            continue  # then skip processing the line further.
        if line:  # If the line is not empty, simply add it to the last group.
            myList[-1].append(line)

# Recompose the lines back to strings (separated by spaces, not newlines).
myList = [' '.join(doc) for doc in myList]

print(myList)

输出为:

[
    "Hello World. This is another question",
    "This is a new text file. Not much in it.",
    "This is the third text. It contains separate info.",
    "The final text. A short one.",
]