问题摘要
短版
如何从Dask的Pandas DataFrame包转到单个的Dask DataFrame?
长版
我有许多dask.dataframe的各种read
函数(例如dd.read_csv
或dd.read_parquet
)无法读取的文件。我确实有自己的函数,可以将它们读为Pandas DataFrames(函数一次只能在一个文件上使用,类似于pd.read_csv
)。我想将所有这些单个Pandas DataFrames放在一个大的Dask DataFrame中。
最小工作示例
以下是一些示例CSV数据(我的数据实际上不是CSV格式,但是为了便于示例在此处使用)。要创建一个最小的工作示例,您可以将其另存为CSV并制作一些副本,然后使用下面的代码
"gender","race/ethnicity","parental level of education","lunch","test preparation course","math score","reading score","writing score"
"female","group B","bachelor's degree","standard","none","72","72","74"
"female","group C","some college","standard","completed","69","90","88"
"female","group B","master's degree","standard","none","90","95","93"
"male","group A","associate's degree","free/reduced","none","47","57","44"
"male","group C","some college","standard","none","76","78","75"
from glob import glob
import pandas as pd
import dask.bag as db
files = glob('/path/to/your/csvs/*.csv')
bag = db.from_sequence(files).map(pd.read_csv)
到目前为止我尝试过的事情
import pandas as pd
import dask.bag as db
import dask.dataframe as dd
# Create a Dask bag of pandas dataframes
bag = db.from_sequence(list_of_files).map(my_reader_function)
df = bag.map(lambda x: x.to_records()).to_dataframe() # this doesn't work
df = bag.map(lambda x: x.to_dict(orient = <any option>)).to_dataframe() # neither does this
# This gets me really close. It's a bag of Dask DataFrames.
# But I can't figure out how to concatenate them together
df = bag.map(dd.from_pandas, npartitions = 1)
df = dd.from_delayed(bag) # returns an error
答案 0 :(得分:1)
我建议将dask.delayed与dask.dataframe一起使用。有一个很好的例子可以在这里完成您想做的事情:
答案 1 :(得分:0)
如果您已经有一堆数据帧,那么您可以执行以下操作:
在python代码中:
def bag_to_dataframe(bag, **concat_kwargs):
partitions = bag.to_delayed()
dataframes = map(
dask.delayed(lambda partition: pandas.concat(partition, **concat_kwargs)),
partitions
)
return dask.dataframe.from_delayed(dataframes)
您可能想要控制分区的串联,例如忽略索引。