Python CSV编写器自动限制每个文件的行数并创建新文件

时间:2017-11-28 16:58:39

标签: python csv

我正在编写一个脚本,它会将大量数据写入.csv文件。为了使感兴趣的用户之间的数据传输更容易,我想对每个文件的行数实施限制。例如,我希望将第一百万条记录写入some_csv_file_1.csv,将第二百万条记录写入some_csv_file_2.csv等,直到所有记录都写完为止。

我试图让以下工作:

import csv
csv_record_counter = 1
csv_file_counter = 1

while csv_record_counter <= 1000000:
    with open('some_csv_file_' + str(csv_file_counter) + '.csv', 'w') as csvfile:
        output_writer = csv.writer(csvfile, lineterminator = "\n")
        output_writer.writerow(['record'])
        csv_record_counter += 1
while not csv_record_counter <= 1000000:
    csv_record_counter = 1
    csv_file_counter += 1

问题:随着记录增加超过1000000,不会创建后续文件。该脚本继续将记录添加到原始文件中。

4 个答案:

答案 0 :(得分:0)

首先对第二个while循环进行indendt并删除&#34; not&#34;。 然后使用for-而不是while循环来创建csvs。 另外,不要忘记重置csv_record_counter。

import csv
csv_record_counter = 1

rows = #Your number of rows to process

additional_file = 1 if rows/1000000 % 2 != 0 else 0

for csv_file in range(1, int(rows/1000000) + 1 + additional_file): #Set rows as your maximum number of rows / This will return your number of csv to create
    with open('some_csv_file_' + str(csv_file) + '.csv', 'w') as csvfile:
        output_writer = csv.writer(csvfile, lineterminator = "\n")
        output_writer.writerow(['record'])
        csv_record_counter = 1 #Remove your "+"
        while csv_record_counter <= 1000000: #Remove your "not"
            csv_record_counter += 1
            output_writer.writerow("your record")

编辑:添加了additional_file

答案 1 :(得分:0)

我想在导出数据之前先对数据进行批处理。

def batch(iterable, n=1):
    length = len(iterable)
    for ndx in range(0, length, n):
        yield iterable[ndx:min(ndx + n, length)]

headers = []  # Your headers
products = []  # Milions of products go here
batch_size = int(len(db_products) / 4)  # Example
# OR in your case, batch_size = 1000000000

for idx, product_batch in enumerate(batch(products, batch_size)):
    with open('products_{}.csv'.format(idx + 1), 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=headers)
        writer.writeheader()
        for product in product_batch:     
            writer.writerow(product)   

参考:

答案 2 :(得分:0)

我认为您的数据传输可以通过上述课程成功:

import csv

class Writer:
    def __init__(self, max_row):
        self.max_row = max_row
        self.cur_row = 0
        self.file_number = 0
        self.file_handle = None

    def write_row(self, row):
        if self.cur_row >= self.max_row or self.file_handle == None:
            self.cur_row = 0
            self.file_number += 1

            if self.file_handle:
                self.file_handle.close()

            self.file_handle = open(f'some_csv_file_{self.file_number}.csv', 'w', newline='')
            self.csv_handle = csv.writer(self.file_handle)

        self.csv_handle.writerow(row)
        self.cur_row += 1


writer = Writer(10) # 1000000 for you

for row in range(55): # massive amount of data
    output_row = [row+1, "record1", "record2"]
    writer.write_row(output_row)

在示例中,当前正在生成每个文件的 10 条记录(some_csv_file_1.csvsome_csv_file_2.csv、...)。

输出:

result

给你:

output_writer = Writer(1000000)
output_writer.write_row(['record'])

答案 3 :(得分:-1)

使用writefile.flush()后尝试writer.writerow()

该flush语句将清除缓冲区,使ram有空完成新任务。

在处理大量行时,缓冲区将被任务填满,并且直到当前运行的代码退出时才会清除。

因此,每次使用write语句在文件中写入内容时,最好手动清除缓冲区