Groupby-apply:Pandas OK,Dask Not

时间:2019-05-25 16:55:31

标签: python pandas dask

我正在尝试使用Pandas代码作为起始模板的Dask使用groupby-apply语法。首先是一些综合数据:

pandasdf = pd.DataFrame(data = [(1990, 1, 'OK', 93, 153), (1990, 1, 'OK', 22, 12), (1990, 1, 'NOK', 7, 21),
                             (1990, 2, 'OK', 91, 185), (1990, 2, 'NOK', 22, 17)],
                            columns = ['Year', 'Month', 'Status', 'Count', 'Amount'])

然后使用适用于熊猫的代码:

def makepandasgroupedpercent(groupdf: pd.DataFrame)->pd.DataFrame:
    """
        return df with percent of count and amount per grouped df
        positional args:
        groupdf: grouped df
    """
    countsum = groupdf['Count'].sum()
    amountsum = groupdf['Amount'].sum()
    groupdf['Count%'] = round(groupdf['Count']*100 / countsum, 2)
    groupdf['Amount%'] = round(groupdf['Amount']*100 / amountsum, 2)

    return groupdf

pandaspercentgroupeddf = (pandasdf[['Year', 'Month', 'Status', 'Count', 'Amount']] 
            .groupby(['Year', 'Month', 'Status']) 
            .agg(sum)
        .groupby(['Year', 'Month']
        .apply(makepandasgroupedpercent))

现在要掌握(v 1.1.1):

daskdf = dd.from_pandas(pandasdf, chunksize=100)
daskdf.head() #OK

def makedaskgroupedpercent(groupdf: dd)-> dd:
    """
        return df with percent of count and amount per grouped df
        positional args:
        groupdf: grouped df
    """
    countsum = groupdf['Count'].sum().compute()#same error w/o compute()
    #https://stackoverflow.com/questions/52663751/dask-dataframe-sum-of-column-always-returning-scalar
    amountsum = groupdf['Amount'].sum().compute()
    groupdf['Count%'] = groupdf['Count']*100 / countsum
    groupdf['Amount%'] = groupdf['Amount']*100 / amountsum

    return groupdf

daskpercentgroupeddf = (daskdf[['Year', 'Month', 'Status', 'Count', 'Amount']]
                        .groupby(['Year', 'Month', 'Status'])
                        .agg(sum)
                        .groupby(['Year', 'Month'])
                        .apply(makedaskgroupedpercent, 
                            meta={'Count': 'int', 'Amount': 'f8', 'Count%': 'f8', 'Amount%': 'f8'}).compute())

错误消息:

Length of passed values is 0, index implies 4

还有一条警告消息:

FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead
  return pd.MultiIndex(levels=levels, labels=labels, names=idx.names)

谢谢

0 个答案:

没有答案