Question

我试图在新列中以月份的形式获得产品销售的最小日期和最长日期之间的差异。但是当我在groupby中应用函数时，我得到了不寻常的回报。

非常感谢任何帮助。

所以我的步骤是：

数据：

    pch_date      day product  qty  unit_price  total_price  year_month  
421 2013-01-07  tuesday      p3   13        4.58        59.54           1   
141 2015-09-13   monday      p8    3        3.77        11.31           9   
249 2015-02-02   monday      p5    3        1.80         5.40           2   
826 2015-10-09  tuesday      p5    6        1.80        10.80          10   
427 2014-04-18   friday      p7    6        4.21        25.26           4

功能定义：

    def diff_date(x):
       max_date = x.max()
       min_date = x.min()
       diff_month = (max_date.year - min_date.year)*12 + max_date.month +1
       return diff_month

尝试测试时：

    print diff_date(prod_df['pch_date'])

49这是正确的

但问题：

print prod_df[['product','pch_date']].groupby(['product']).agg({'pch_date': diff_date}).reset_index()[:5]

结果还有一个额外的日期：

      product                 pch_date

0      p1 1970-01-01 00:00:00.000000049
1     p10 1970-01-01 00:00:00.000000048
2     p11 1970-01-01 00:00:00.000000045
3     p12 1970-01-01 00:00:00.000000049
4     p13 1970-01-01 00:00:00.000000045

如何获得整数差异？

Answer 1

您可以使用Groupby.apply代替返回整数而不是日期时间对象。

df.groupby(['product'])['pch_date'].apply(diff_date).reset_index()

作为不让整数值转换为DatetimeIndex值的解决方法，您可以将函数的最后一行更改为str(diff_month)，然后您可以继续使用Groupby.agg，如下所示：

df.groupby(['product'])['pch_date'].agg({'pch_date': diff_date}).reset_index()

在熊猫中的groupby datediff

1 个答案: