到达标记时拼接文本文件

时间:2014-09-16 14:49:29

标签: python

我有一个主文件,它是报纸文章的集合,我需要将每篇报纸文章放入自己的文件中。值得庆幸的是,每篇文章的最后一行都是版权声明,因此我写了以下内容以尝试自动完成我想要的内容:

def splicearticles():
    countart = 1
    new_file = "article1.txt"
    with open("newspaperarticles.txt", "r") as my_file:
        with open("temporaryarticle.txt", "a+") as my_temporary:
            for line in my_file:
                if line.strip() != "Reserved by Author":
                    currentline = line.strip() + "\n"
                    my_temporary.write(currentline)
                else:
                    with open(new_file, "w") as my_final:
                        my_final.write(my_temporary.read())
                    countart += 1
                    new_file = "article" + str(countart) + ".txt"
                    my_temporary.truncate(0)

问题似乎在于my_final.write(my_temporary.read()),因为代码的所有其他部分都已执行。谁能让我知道我做错了什么?

2 个答案:

答案 0 :(得分:0)

                with open(new_file, "w") as my_final:
                    my_final.write(my_temporary.read())

在此处执行my_temporary.read()时,文件位置指向临时文件的末尾。因此read电话不会读取任何内容。尝试先将文件位置返回到文件的开头。

                with open(new_file, "w") as my_final:
                    my_temporary.seek(0)
                    my_final.write(my_temporary.read())

或者,根本不要使用临时文件对象。您可以轻松地将行存储在列表中。

def splicearticles():
    countart = 1
    new_file = "article1.txt"
    with open("newspaperarticles.txt", "r") as my_file:
        temp = []
        for line in my_file:
            if line.strip() != "Reserved by Author":
                currentline = line.strip() + "\n"
                temp.append(currentline)
            else:
                with open(new_file, "w") as my_final:
                    my_final.write("".join(temp))
                countart += 1
                new_file = "article" + str(countart) + ".txt"
                temp = []

答案 1 :(得分:0)

如果使用正则表达式,可能会更容易。

假设您拥有代表3篇文章的文件:

Article 1
Blah blah blah blah did blah

Reserved by Author

Article 2
This article goes on and on
end of that article
Reserved by Author

Now we have article 3
blah blah
And it goes on

您可以对文件进行内存映射,使其看起来像一个字符串,而不必将其全部加载到内存中。这允许您对文件内容使用正则表达式并在“作者保留”的分隔符上拆分:

import re
import mmap

fn='/tmp/articles.txt'

with open(fn) as articles:
    mf=mmap.mmap(articles.fileno(), 0, access=mmap.ACCESS_READ)
    chunks=re.finditer(r'(.+?)(?:Reserved by Author\s*\n|\Z)', mf, re.S | re.M)
    for i, block in enumerate(chunks, 1):
        text=block.group(1)
        with open('/tmp/article {}.txt'.format(i), 'w') as fout:
            fout.write(text)

通过这个简单的例子,我们创建了3个新文件:

$ cat "article 1.txt"
Article 1
Blah blah blah blah did blah

$ cat "article 2.txt"
Article 2
This article goes on and on
end of that article

$ cat "article 3.txt"
Now we have article 3
blah blah
And it goes on