分组并发现差异

时间:2019-04-23 19:37:58

标签: python-3.x pandas pandas-groupby

我有一只熊猫DF:

df = pd.DataFrame(np.random.randint(1,10,size=(6,2)),columns = list("AB"))
df["A"] = ["1111","2222","1111","1111","2222","1111"]
df["B"] = ["2001-01-10","2001-01-02","2001-02-11","2001-03-14","2001-02-01","2001-04-14"]
df

OP:

     A         B
0   1111    2001-01-10
1   2222    2001-01-02
2   1111    2001-02-11
3   1111    2001-03-14
4   2222    2001-02-01
5   1111    2001-04-14

我正在尝试创建一个新列->

max(difference in (month,day) of transaction for every user)

例如,对于用户“ 1111”,不同的(月,日)交易是:

[('01','10'),('02','11'),('03','14'),('04','14')]

和区别是

[1,3,0] => max(diff) = 3

因为第一笔交易是在1月10日,而下一笔交易是在2月11日(11-10 => 1),然后是2月3日和4月14日的两笔交易(14 -11 => 3)和(14- 14 => 0)。

预期的操作次数:

 A    Max_diff
1111   3

代码:

df.groupby("A",as_index=False).apply(lambda x: list(map(lambda d: (d.split("-")[1],d.split("-")[2]),x["B"])))

OP:

0    [(01, 01), (02, 02), (03, 03), (04, 03)]
1                        [(01, 02), (02, 01)]
dtype: object

我正在反复查找最大值。如果我在庞大的数据集上尝试,会花费很多时间。实现此预期OP的任何其他解决方法。

2 个答案:

答案 0 :(得分:1)

这就是您需要的

df.B.dt.day.groupby(df.A).diff().groupby(df.A).max()
Out[177]: 
A
1111    3.0
2222   -1.0
Name: B, dtype: float64

答案 1 :(得分:1)

这将找到给定组的日期之间的最大差异。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1,10,size=(6,2)),columns = list("AB"))
df["A"] = ["1111","2222","1111","1111","2222","1111"]
df["B"] = ["2001-01-10","2001-01-02","2001-02-11","2001-03-14","2001-02-01","2001-04-14"]

df["B"] = pd.to_datetime(df["B"])

def myfunc(x):
    #x.sort_values(by=['B'])
    x["Trans Diff Days"] = x["B"].diff()
    return x["Trans Diff Days"]

new_series = df.groupby("A").apply(myfunc)
print(new_series.groupby("A").max())

输出为

A
1111   32 days
2222   30 days