创建多个输出文件但为空

时间:2015-03-19 15:39:38

标签: regex file python-2.7

我正在尝试将包含两篇文章的文件拆分为两个单独的文件,每个文件中包含一篇文章,以便后续分析文章。初始文件中的每篇文章都有一个ID,我想用它来分隔文件,使用RE。

以下是初始输入文件,ID号为:

166068619   ####    "Epilepsy: let's end our ignorance of this neglected condition
Helen Stephens is a young woman with epilepsy [...]." 
106899978   ####    "Great British Payoff shows that BBC governance is broken
If it was a television series, they'd probably call it [...]."

然而,当我运行我的代码时,我确实得到两个单独的文件作为输出,但它们是空的。

这是我的代码:

def file_split(path_to_file):
    """Function splits bigger file into N smaller ones, based on a certain RE
    match, that is used to break the bigger file into smaller ones"""
    def pattern_extract(path_to_file):   
        """Function identifies the number of RE occurences in a file, 
        No. can be used in further analysis as range No."""
        import re
        x = []
        with open(path_to_file) as f:
            for line in f:
                match = re.search(r'^\d+?\t####\t', line)
                if match:
                    a = match.group()
                    x.append(a)
        return len(x)

    y = pattern_extract(path_to_file)
    m = y + 1

    files = [open('filename%i.txt' %i, 'w') for i in range(1,m)]
    with open(path_to_file) as f:
        for line in f:
            match = re.search(r'^\d+?\t####\t', line)
            if match:
                a = match.group()
                #files = [open('filename%i.txt' %i, 'w') for i in range(1, m)]
                files[i-1].write(a)
    for f in files:
        f.close()
    return files

输出结果如下:

file_split(path)
Out[19]: 

[<open file 'filename1.txt', mode 'w' at 0x7fe121b130c0>,
 <open file 'filename2.txt', mode 'w' at 0x7fe121b131e0>]

我是Python新手,我不太确定问题出在哪里。我检查了一些解决多个文件输出的其他答案,但无法找出解决方案。非常感谢帮助。

1 个答案:

答案 0 :(得分:0)

您的代码存在两个问题:

  • 你只写了与ID匹配的行(实际上只是匹配本身),而不是其余的
  • 你总是写到最后一个文件,因为你使用i,循环变量“遗留”来自列表理解

要修复它,您可以将代码的下半部分更改为:

y = pattern_extract(path_to_file)
files = [open('filename%i.txt' %i, 'w') for i in range(y)]
n = -1
with open(path_to_file) as f:
    for line in f:
        if re.search(r'^\d+\s+####\s+', line):
            n += 1
        files[n].write(line)

但是你根本不需要读取文件两次,只是为了计算匹配:当行与ID行匹配时直接打开另一个文件并直接写入列表中的最后一个文件,然后关闭所有文件

open_files = []
with open(path_to_file) as f:
    for line in f:
        if re.search(r'^\d+\s+####\s+', line):
            open_files.append(open('filename%d.txt' % len(open_files), 'w'))
        open_files[-1].write(line)

for f in open_files:
    f.close()