Dask Dataframe:在groubpy中为日期差异定义元

时间:2019-03-26 17:56:08

标签: pandas datetime pandas-groupby dask

我正在尝试为客户查找购买间隔时间(即订单之间的天数)。尽管我的代码在未定义meta的情况下仍能正常运行,但我希望它能正常运行,并且不再看到警告要求我提供meta的警告。

此外,对于使用mapmap_partitions而不是apply的任何建议,我也将不胜感激。

到目前为止,我已经尝试过:

  • meta={'days_since_last_order': 'datetime64[ns]'}

  • meta={'days_since_last_order': 'f8'}

  • meta={'ORDER_DATE_DT':'datetime64[ns]','days_since_last_order': 'datetime64[ns]'}

  • meta={'ORDER_DATE_DT':'f8','days_since_last_order': 'f8'}

  • meta=('days_since_last_order', 'f8')

  • meta=('days_since_last_order', 'datetime64[ns]')

这是我的代码:

import numpy as np
import pandas as pd
import datetime as dt
import dask.dataframe as dd
from dask.distributed import wait, Client

client = Client(processes=True)

start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
d = (end - start).days + 1

np.random.seed(0)
df = pd.DataFrame()
df['CUSTOMER_ID'] = np.random.randint(1, 4, 10)
df['ORDER_DATE_DT'] = start + pd.to_timedelta(np.random.randint(1, d, 10), unit='d')
print(df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT']))
print(df)

ddf = dd.from_pandas(df, npartitions=2)

# setting ORDER_DATE_DT as index to sort by date
ddf = ddf.set_index('ORDER_DATE_DT')
ddf = client.persist(ddf)
wait(ddf)

ddf = ddf.reset_index()
grp = ddf.groupby('CUSTOMER_ID')[['ORDER_DATE_DT']].apply(
    lambda df: df.assign(days_since_last_order=df.ORDER_DATE_DT.diff(1))
    # meta=????
)

# for some reason, I'm unable to print grp unless I reset_index()
grp = grp.reset_index()
print(grp.compute())

这是df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT'])

的打印输出

enter image description here

这是grp.compute()

的打印输出

enter image description here

0 个答案:

没有答案