Question

我有一个以下格式的文本文件：

1. AUTHOR1

(blank line, with a carriage return)

Citation1

2. AUTHOR2

(blank line, with a carriage return)

Citation2

(...)

也就是说，在这个文件中，一些行以整数开头，后跟一个点，一个空格和一个表示作者姓名的文本;这些行后跟一个空白行（包括一个回车符），然后是一行以字母字符开头的文本（文章或书籍引用）。

我想要的是将此文件读入Python列表，加入作者的姓名和引文，以便每个列表元素的格式如下：

['AUTHOR1 Citation1'，'AUTHOR2 Citation2'，'...']

它看起来像一个简单的编程问题，但我无法找到解决方案。我的尝试如下：

articles = []
with open("sample.txt", "rb") as infile:
    while True:
        text = infile.readline()
        if not text: break
        authors = ""
        citation = ""
        if text == '\n': continue
        if text[0].isdigit():
           authors = text.strip('\n')
        else:
           citation = text.strip('\n'
        articles.append(authors+' '+citation)

但文章列表将作者和引文存储为单独的元素！

提前感谢您解决这个棘手问题的任何帮助......： - （

Answer 1

假设您的输入文件结构：

"""
1. AUTHOR1

Citation1
2. AUTHOR2

Citation2
"""

不会改变我会使用readlines()并切片：

with open('sample.txt', 'r') as infile:
    lines = infile.readlines()
    if lines:
        lines  = filter( lambda x : x != '\n', lines ) # remove empty lines
        auth   = map( lambda x : x.strip().split('.')[-1].strip(), lines[0::2] )
        cita   = map( lambda x : x.strip(), lines[1::2] )
        result = [ '%s %s'%(auth[i], cita[i]) for i in xrange( len( auth ))  ]
        print result

# ['AUTHOR1 Citation1', 'AUTHOR2 Citation2']

Answer 2

您可以使用readline跳过空行。这是你的循环体：

author = infile.readline().strip().split(' ')[1]
infile.readline()
citation = infile.readline()
articles.append("{} {}".format(author, citation))

Answer 3

问题在于，在每次循环迭代中，您只获得一个，作者或引用，而不是两者。所以，当你做追加时你只有一个元素。

解决此问题的一种方法是在每次循环迭代中读取两者。

Answer 4

这应该有效：

articles = []
with open("sample.txt") as infile:
    for raw_line in infile:
        line = raw_line.strip()
        if not line:
            continue
        if line[0].isdigit():
            author = line.split(None, 1)[-1]
        else:
            articles.append('{} {}'.format(author, line))

Answer 5

基于切片的解决方案非常简洁，但如果只有一个空白行不合适，它会抛弃整个事物。这是一个使用正则表达式的解决方案，即使结构有变化也应该有效：

import re

pattern = re.compile(r'(^\d\..*$)\n*(^\w.*$)', re.MULTILINE)
with open("sample.txt", "rb") as infile:
    lines = infile.readlines()
matches = pattern.findall(lines)
formatted_output = [author + ' ' + citation for author, citation in matches]

在Python

5 个答案: