我正在尝试使用Pandas代码作为起始模板的Dask使用groupby-apply语法。首先是一些综合数据:
pandasdf = pd.DataFrame(data = [(1990, 1, 'OK', 93, 153), (1990, 1, 'OK', 22, 12), (1990, 1, 'NOK', 7, 21),
(1990, 2, 'OK', 91, 185), (1990, 2, 'NOK', 22, 17)],
columns = ['Year', 'Month', 'Status', 'Count', 'Amount'])
然后使用适用于熊猫的代码:
def makepandasgroupedpercent(groupdf: pd.DataFrame)->pd.DataFrame:
"""
return df with percent of count and amount per grouped df
positional args:
groupdf: grouped df
"""
countsum = groupdf['Count'].sum()
amountsum = groupdf['Amount'].sum()
groupdf['Count%'] = round(groupdf['Count']*100 / countsum, 2)
groupdf['Amount%'] = round(groupdf['Amount']*100 / amountsum, 2)
return groupdf
pandaspercentgroupeddf = (pandasdf[['Year', 'Month', 'Status', 'Count', 'Amount']]
.groupby(['Year', 'Month', 'Status'])
.agg(sum)
.groupby(['Year', 'Month']
.apply(makepandasgroupedpercent))
现在要掌握(v 1.1.1):
daskdf = dd.from_pandas(pandasdf, chunksize=100)
daskdf.head() #OK
def makedaskgroupedpercent(groupdf: dd)-> dd:
"""
return df with percent of count and amount per grouped df
positional args:
groupdf: grouped df
"""
countsum = groupdf['Count'].sum().compute()#same error w/o compute()
#https://stackoverflow.com/questions/52663751/dask-dataframe-sum-of-column-always-returning-scalar
amountsum = groupdf['Amount'].sum().compute()
groupdf['Count%'] = groupdf['Count']*100 / countsum
groupdf['Amount%'] = groupdf['Amount']*100 / amountsum
return groupdf
daskpercentgroupeddf = (daskdf[['Year', 'Month', 'Status', 'Count', 'Amount']]
.groupby(['Year', 'Month', 'Status'])
.agg(sum)
.groupby(['Year', 'Month'])
.apply(makedaskgroupedpercent,
meta={'Count': 'int', 'Amount': 'f8', 'Count%': 'f8', 'Amount%': 'f8'}).compute())
错误消息:
Length of passed values is 0, index implies 4
还有一条警告消息:
FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead
return pd.MultiIndex(levels=levels, labels=labels, names=idx.names)
谢谢