Question

我有一个CSV文件，该文件在两个空白行之后开始一个新主题。我想将此文件拆分为两个不同的文件。我该怎么办？

................
................                
Biology I               
BGS Shivamogga I PUC    Exam Results            
Student Exam    # Questions Correct Answers Score %
ADARSHGOUDA M MUDIGOUDAR    Biology I - Chapter 1   35  23  65.70%
ADARSHGOUDA M MUDIGOUDAR    Biology I - Chapter 1   35  29  82.90%
ADARSHGOUDA M MUDIGOUDAR    Biology I - Chapter 1   35  32  91.40%
.
.
.
.

................
................                
Chemistry I             
BGS Shivamogga I PUC    Exam Results            
Student Exam    # Questions Correct Answers Score %
AISHWARYA P Chemistry I - Chapter 1 29  20  69.00%
MAHARUDRASWAMY M S  Chemistry I - Chapter 1 29  14  48.30%
NIKHIL B    Chemistry I - Chapter 1 29  20  69.00%

我尝试使用dropnas和skiprows拆分数据帧，但是我不想对行数进行硬编码。我想根据前两个空白行进行拆分。

Answer 1

我会按照以下方式做些事情：

with open('input.txt','r') as input_file:
    data_str = input_file.read()
    data_array = data_str.split('\n\n') # Split on all instances of double new lines
    for i, smaller_data in enumerate(data_array):   
        with open(f'new_file_{i}.txt','w') as new_data_file:
            new_data_file.write(smaller_data)

Answer 2

我只使用csv模块，处理从csv.reader()到csv.writer()对象的行，并保持连续的空白行数。每次找到多个空白行时，将写对象替换为一个新文件。

您可以使用any() function检测到空行，因为空白行将仅包含空字符串或完全没有值：

isblank = not any(row)

假定在同一目录中已编号的文件就足够了，这应该可以工作：

import csv
from pathlib import Path

def gen_outputfiles(outputdir, basefilename):
    """Generate open files ready for CSV writing, in outputdir using basefilename

    Numbers are inserted between the basefilename stem and suffix; e.g.
    foobar.csv becomes foobar001.csv, foobar002.csv, etc.

    """
    outputbase = Path(basefilename)
    outputstem, outputsuffix = outputbase.stem, outpubase.suffix
    counter = 0
    while True:
        counter += 1
        yield outputdir / f'{outputstem}{counter:03d}{outputsuffix}'.open(mode='w', newline='')

def split_csv_on_doubleblanks(inputfilename, basefilename=None, **kwargs):
    """Copy CSV rows from inputfilename to numbered files based on basefilename

    A new numbered target file is created after 2 or more blank rows have been
    read from the input CSV file.

    """
    inputpath = Path(inputfilename)
    outputfiles = gen_outputfiles(inputpath.parent, basefilename or inputpath.name)

    with inputpath.open(newline='') as inputfile:
        reader = csv.reader(inputfile, **kwargs)
        outputfile = next(outputfiles())
        writer = csv.writer(outputfile, **kwargs)
        blanks = 0
        try:
            for row in reader:
                isblank = not any(row)
                if not isblank and blank > 1:
                    # skipped more than one blank row before finding a non-blank
                    # row. Open a new output file
                    outputfile.close()
                    outputfile = next(outputfile)
                    writer = csv.writer(outputfile, **kwargs)
                blank = blank + 1 if isblank else 0
                writer.writerow(row)
        finally:
            if not outputfile.closed:
                outputfile.close()

请注意，我也跨空白行进行复制，因此您的文件确实以多个空白行结尾。可以通过以下方法来解决这一问题：将blanks计数器替换为空白行列表，以便在您每次要重置计数器且该列表中只有一个元素时将其写入writer对象。这样一来，将保留单个空白行。

如何在空白行上拆分CSV文件

2 个答案: