假设我有以下pandas数据帧:
df = pd.DataFrame({'name':['Dave','Lisa','John',Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],
'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
DATAFRAME1
date name
0 2015-01-31 07:14:39 Dave
1 2014-12-16 22:50:55 Lisa
2 2015-04-12 23:29:11 John
3 2015-04-08 17:57:29 Lisa
4 2015-01-30 03:51:12 Simon
5 2015-02-20 10:33:48 Simon
6 2014-12-15 23:54:03 Simon
7 2014-12-16 19:53:53 Simon
8 2014-12-18 00:15:02 Lisa
9 2015-04-01 21:36:55 Dave
10 2015-04-13 23:25:55 Dave
11 2015-02-18 14:10:40 John
12 2015-02-27 04:56:33 Lisa
DATAFRAME2
name datemax
0 Dave 2015-04-13 23:25:55
1 John 2015-04-12 23:29:11
2 Lisa 2015-04-08 17:57:29
3 Simon 2015-02-20 10:33:48
其中'date'和'datemax'列填充了datetime对象。
我需要在DATAFRAME1中按'name'分组,随机选择其中一个日期,但我希望这个选择的日期在第二个数据框(DATAFRAME2)中该名称的'datemax'之前。
我正在处理的真实数据帧比这个例子更大,所以我需要一个快速的方法来做到这一点。
答案 0 :(得分:3)
我会首先拼出所有不符合该标准的日期:
In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[11]:
0 2015-04-13 23:25:55
1 2015-04-08 17:57:29
2 2015-04-12 23:29:11
3 2015-04-08 17:57:29
4 2015-02-20 10:33:48
5 2015-02-20 10:33:48
6 2015-02-20 10:33:48
7 2015-02-20 10:33:48
8 2015-04-08 17:57:29
9 2015-04-13 23:25:55
10 2015-04-13 23:25:55
11 2015-04-12 23:29:11
12 2015-04-08 17:57:29
Name: date, dtype: datetime64[ns]
In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[12]:
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 True
9 True
10 False
11 True
12 True
Name: date, dtype: bool
In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]
In [14]: df_old
Out[14]:
date name
0 2015-01-31 07:14:39 Dave
1 2014-12-16 22:50:55 Lisa
4 2015-01-30 03:51:12 Simon
6 2014-12-15 23:54:03 Simon
7 2014-12-16 19:53:53 Simon
8 2014-12-18 00:15:02 Lisa
9 2015-04-01 21:36:55 Dave
11 2015-02-18 14:10:40 John
12 2015-02-27 04:56:33 Lisa
现在它变成了一个更容易的问题:pick a random row by name:
df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
In [21]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
Out[21]:
date
name
Dave 2015-04-01 21:36:55
John 2015-02-18 14:10:40
Lisa 2014-12-16 22:50:55
Simon 2014-12-15 23:54:03
In [22]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
Out[22]:
date
name
Dave 2015-01-31 07:14:39
John 2015-02-18 14:10:40
Lisa 2014-12-18 00:15:02
Simon 2014-12-16 19:53:53
答案 1 :(得分:1)
这是我的建议:
import random
df = pd.DataFrame({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
df.date = [pd.to_datetime(x) for x in df.date]
df2 = pd.DataFrame([['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48']])
df2.columns = ['name','datemax']
df2.datemax = [pd.to_datetime(x) for x in df2.datemax]
df = df.merge(df2,how='left')
grouped = df.groupby('name')
grouped.apply(lambda x: random.choice([a for a in x['date'].values if a<x['datemax'].values[0]]))
花了18毫秒,我猜它应该线性缩放。
答案 2 :(得分:0)
您可以使用pd.DataFrame.sample
之类的
In [697]: idx = df2.set_index('name').datemax
In [698]: (df1.groupby('name')
.apply(lambda x: x.loc[x.date < idx[x.name]].sample(1))
.reset_index(drop=True))
Out[698]:
date name
0 2015-04-01 21:36:55 Dave
1 2015-02-18 14:10:40 John
2 2014-12-18 00:15:02 Lisa
3 2014-12-16 19:53:53 Simon