按大小拆分数据框,Python 3.6

时间:2018-11-14 12:40:27

标签: python python-3.x pandas dataframe split

我想将数据帧拆分为30MB的不同数据集。然后我需要导出到csv文件。

FileSize = SQLData.memory_usage(index=True, deep=False).sum())
FileSizeMB = FileSize/1038336
if FileSizeMB > 30:
  # Want to split data frame below 30MB.
  # Export splitted Dataframe
else:
    SQLData.to_csv(r'D:\Export\SQLData.csv', sep=',', index=False, na_rep='NA')

这可能吗?

1 个答案:

答案 0 :(得分:0)

尝试以下递归解决方案:

# solution
def save_file_part(df, size_threshold, save_path, part_number=0):
    file_size = df.memory_usage(index=True, deep=False).sum() / 1038336
    num_records = len(df)

    if file_size > size_threshold:
        records_to_split_off = int(num_records * size_threshold // file_size)
        df_to_save = df.head(records_to_split_off)
        df_to_save.to_csv(save_path.format(part_number),sep=',', index=False, na_rep='NA')
        save_file_part(df.tail(num_records-records_to_split_off), size_threshold, save_path, part_number=part_number+1)

    else:
        df.to_csv(save_path.format(part_number), sep=',', index=False, na_rep='NA')


# example
dates = pd.date_range('20130101',periods=60000)
df = pd.DataFrame(np.random.randn(60000,4),index=dates,columns=list('ABCD'))
file_size = df.memory_usage(index=True, deep=False).sum() / 1038336
print(file_size)

save_file_part(df, 0.5, save_path="c:/tmp/my_df_{}.csv") # note, the function expects "save_path" as a string with at least one "{}" placeholder

df.memory_usage告诉您熊猫DataFrame在内存中的大小。保存csv时,大小会有所不同(会更大),因此您可能希望将size_threshold设置为15 Mb。您可以使用脚本找出合适的尺寸,但也可以尝试一下以找到正确的口粮