我正在编写一个脚本,它会将大量数据写入.csv
文件。为了使感兴趣的用户之间的数据传输更容易,我想对每个文件的行数实施限制。例如,我希望将第一百万条记录写入some_csv_file_1.csv
,将第二百万条记录写入some_csv_file_2.csv
等,直到所有记录都写完为止。
我试图让以下工作:
import csv
csv_record_counter = 1
csv_file_counter = 1
while csv_record_counter <= 1000000:
with open('some_csv_file_' + str(csv_file_counter) + '.csv', 'w') as csvfile:
output_writer = csv.writer(csvfile, lineterminator = "\n")
output_writer.writerow(['record'])
csv_record_counter += 1
while not csv_record_counter <= 1000000:
csv_record_counter = 1
csv_file_counter += 1
问题:随着记录增加超过1000000,不会创建后续文件。该脚本继续将记录添加到原始文件中。
答案 0 :(得分:0)
首先对第二个while循环进行indendt并删除&#34; not&#34;。 然后使用for-而不是while循环来创建csvs。 另外,不要忘记重置csv_record_counter。
import csv
csv_record_counter = 1
rows = #Your number of rows to process
additional_file = 1 if rows/1000000 % 2 != 0 else 0
for csv_file in range(1, int(rows/1000000) + 1 + additional_file): #Set rows as your maximum number of rows / This will return your number of csv to create
with open('some_csv_file_' + str(csv_file) + '.csv', 'w') as csvfile:
output_writer = csv.writer(csvfile, lineterminator = "\n")
output_writer.writerow(['record'])
csv_record_counter = 1 #Remove your "+"
while csv_record_counter <= 1000000: #Remove your "not"
csv_record_counter += 1
output_writer.writerow("your record")
编辑:添加了additional_file
答案 1 :(得分:0)
我想在导出数据之前先对数据进行批处理。
def batch(iterable, n=1):
length = len(iterable)
for ndx in range(0, length, n):
yield iterable[ndx:min(ndx + n, length)]
headers = [] # Your headers
products = [] # Milions of products go here
batch_size = int(len(db_products) / 4) # Example
# OR in your case, batch_size = 1000000000
for idx, product_batch in enumerate(batch(products, batch_size)):
with open('products_{}.csv'.format(idx + 1), 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=headers)
writer.writeheader()
for product in product_batch:
writer.writerow(product)
参考:
答案 2 :(得分:0)
我认为您的数据传输可以通过上述课程成功:
import csv
class Writer:
def __init__(self, max_row):
self.max_row = max_row
self.cur_row = 0
self.file_number = 0
self.file_handle = None
def write_row(self, row):
if self.cur_row >= self.max_row or self.file_handle == None:
self.cur_row = 0
self.file_number += 1
if self.file_handle:
self.file_handle.close()
self.file_handle = open(f'some_csv_file_{self.file_number}.csv', 'w', newline='')
self.csv_handle = csv.writer(self.file_handle)
self.csv_handle.writerow(row)
self.cur_row += 1
writer = Writer(10) # 1000000 for you
for row in range(55): # massive amount of data
output_row = [row+1, "record1", "record2"]
writer.write_row(output_row)
在示例中,当前正在生成每个文件的 10 条记录(some_csv_file_1.csv
、some_csv_file_2.csv
、...)。
输出:
给你:
output_writer = Writer(1000000)
output_writer.write_row(['record'])
答案 3 :(得分:-1)
使用writefile.flush()
后尝试writer.writerow()
该flush语句将清除缓冲区,使ram有空完成新任务。
在处理大量行时,缓冲区将被任务填满,并且直到当前运行的代码退出时才会清除。
因此,每次使用write语句在文件中写入内容时,最好手动清除缓冲区