表(df):
customer_id Order_date
1 2015-01-16
1 2015-01-19
2 2014-12-21
2 2015-01-10
1 2015-01-10
3 2018-01-18
3 2017-03-04
4 2019-11-05
4 2010-01-01
3 2019-02-03
3 2020-01-01
3 2018-01-01
所需的输出:df的子集,其中customer_ID具有3个以上的order_dates。 (跳过2、4和5个客户ID,因为他们的订单日期少于3个)
Customer_id Number_of_Order_dates
1 3
3 5
我尝试了groupby,但是它无法创建子集。请帮忙。
到目前为止,代码尝试失败了:
df[df['days'].count()>3]
还有我尝试过的另一种错误:
df1=df.groupby('customer_id')['order_date'].count()
df[df1.iloc[:,1]]
答案 0 :(得分:6)
IIUC
df.groupby('customer_id')['Order_date'].nunique().loc[lambda x : x>=3].reset_index()
Out[94]:
customer_id Order_date
0 1 3
1 3 5
答案 1 :(得分:4)
您可以使用:
df.groupby('customer_id').filter(lambda x:
(x['Order_date'].nunique()>=3)).groupby('customer_id').count()
或者:
(df[df.groupby('customer_id')['Order_date'].transform('nunique').ge(3)]
.groupby('customer_id').count())
Order_date
customer_id
1 3
3 5
答案 2 :(得分:2)
将GroupBy.nunique
与DataFrame.query
一起使用:
df.groupby('customer_id')['Order_date'].nunique().reset_index().query('Order_date >= 3')
customer_id Order_date
0 1 3
2 3 5
答案 3 :(得分:1)
dict
d = {}
for c, o in zip(*map(df.get, df)):
d.setdefault(c, set()).add(o)
pd.DataFrame(
[(c, len(o)) for c, o in d.items() if len(o) >= 3],
columns=[*df]
)
customer_id Order_date
0 1 3
1 3 5
pd.factorize
和np.bincount
i, u = df.drop_duplicates().customer_id.factorize()
c = np.bincount(i)
pd.DataFrame(
[(u_, c_) for u_, c_ in zip(u, c) if c_ > 2],
columns=[*df]
)
customer_id Order_date
0 1 3
1 3 5
答案 4 :(得分:0)
一种蛮力的方法是将groupby作为新列(pointers)添加,其名称类似于num_dates
,然后像这样限制整个df:
result = my_df[my_df['num_dates'] > 3]