我有一个10k行的csv,我想用1k行的块写入s3。
from io import StringIO
import pandas as pd
csv_buffer = StringIO()
df.to_csv(csv_buffer, chunksize=1000)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'df.csv').put(Body=csv_buffer.getvalue())
这给了我字符串缓冲区中的第一个1k行来写入s3,但它看起来不像csv buffer是一个可以循环的迭代器。
谁知道如何实现这个目标?答案 0 :(得分:2)
看起来StringIO
并没有真正注意到chunksize。 (.readlines()
总是只返回一行,而不是一行。)
我对boto3不是很熟悉,但是itertools.islice
可能在这里需要切片迭代而不创建一些中间数据结构。
如果这看起来可能符合您的需求,我可以在代码旁边添加一些解释:
>>> from io import StringIO
... from itertools import islice
... import sys
...
... import numpy as np
... import pandas as pd
...
... df = pd.DataFrame(np.arange(300).reshape(100, -1))
... csv_buffer = StringIO()
... df.to_csv(csv_buffer)
... csv_buffer.seek(0)
...
... # Account for indivisibility (scoop up a remainder on the final slice).
... chunksize = 33
... rowsize = df.shape[1]
... slices = [(0, chunksize)] * (rowsize - 1) + [(0, sys.maxsize)]
... chunks = (tuple(islice(csv_buffer, i, j)) for i, j in slices)
...
>>> next(chunks)
(',0,1,2\n',
'0,0,1,2\n',
'1,3,4,5\n',
'2,6,7,8\n',
'3,9,10,11\n',
'4,12,13,14\n',
'5,15,16,17\n',
'6,18,19,20\n',
'7,21,22,23\n',
'8,24,25,26\n',
'9,27,28,29\n',
'10,30,31,32\n',
'11,33,34,35\n',
'12,36,37,38\n',
'13,39,40,41\n',
'14,42,43,44\n',
'15,45,46,47\n',
'16,48,49,50\n',
'17,51,52,53\n',
'18,54,55,56\n',
'19,57,58,59\n',
'20,60,61,62\n',
'21,63,64,65\n',
'22,66,67,68\n',
'23,69,70,71\n',
'24,72,73,74\n',
'25,75,76,77\n',
'26,78,79,80\n',
'27,81,82,83\n',
'28,84,85,86\n',
'29,87,88,89\n',
'30,90,91,92\n',
'31,93,94,95\n')