我有以下示例数据:
IND ID value EFFECTIVE DT SYSDATE
8 A 19289 6/30/2017 8/16/2017 10:05
17 A 19289 6/30/2017 8/15/2017 14:25
26 A 19289 6/30/2017 8/14/2017 15:10
7 A 18155 3/31/2017 8/16/2017 10:05
16 A 18155 3/31/2017 8/15/2017 14:25
25 A 18155 3/31/2017 8/14/2017 15:10
6 A 21770 12/31/2016 8/16/2017 10:05
15 A 21770 12/31/2016 8/15/2017 14:25
24 A 21770 12/31/2016 8/14/2017 15:10
5 A 19226 9/30/2016 8/16/2017 10:05
14 A 19226 9/30/2016 8/15/2017 14:25
23 A 19226 9/30/2016 8/14/2017 15:10
4 A 20238 6/30/2016 8/16/2017 10:05
13 A 20238 6/30/2016 8/15/2017 14:25
22 A 20238 6/30/2016 8/14/2017 15:10
3 A 18684 3/31/2016 8/16/2017 10:05
12 A 18684 3/31/2016 8/15/2017 14:25
21 A 18684 3/31/2016 8/14/2017 15:10
2 A 22059 12/31/2015 8/16/2017 10:05
11 A 22059 12/31/2015 8/15/2017 14:25
20 A 22059 12/31/2015 8/14/2017 15:10
1 A 19280 9/30/2015 8/16/2017 10:05
10 A 19280 9/30/2015 8/15/2017 14:25
19 A 19280 9/30/2015 8/14/2017 15:10
0 A 20813 6/30/2015 8/16/2017 10:05
9 A 20813 6/30/2015 8/15/2017 14:25
18 A 20813 6/30/2015 8/14/2017 15:10
这是我每个工作日收集的一组数据(SYSDATE是时间戳)。
我想生成一个df,每天只有两个最新时间戳值中的任何一个标记的行。
所以如果我今天要运行脚本,那么我希望得到这个(来自更大的时间戳组):
IND ID Value EFFECTIVE DT SYSDATE
8 A 19289 6/30/2017 8/16/2017 10:05
17 A 19289 6/30/2017 8/15/2017 14:25
7 A 18155 3/31/2017 8/16/2017 10:05
16 A 18155 3/31/2017 8/15/2017 14:25
6 A 21770 12/31/2016 8/16/2017 10:05
15 A 21770 12/31/2016 8/15/2017 14:25
5 A 19226 9/30/2016 8/16/2017 10:05
14 A 19226 9/30/2016 8/15/2017 14:25
4 A 20238 6/30/2016 8/16/2017 10:05
13 A 20238 6/30/2016 8/15/2017 14:25
3 A 18684 3/31/2016 8/16/2017 10:05
12 A 18684 3/31/2016 8/15/2017 14:25
2 A 22059 12/31/2015 8/16/2017 10:05
11 A 22059 12/31/2015 8/15/2017 14:25
1 A 19280 9/30/2015 8/16/2017 10:05
10 A 19280 9/30/2015 8/15/2017 14:25
0 A 20813 6/30/2015 8/16/2017 10:05
9 A 20813 6/30/2015 8/15/2017 14:25
由于周末和节假日,我无法使用约会时间。
建议?
提前致谢。
答案 0 :(得分:4)
您需要先确保SYSDATE
变为datetime
。我也会为EFFECTIVE DT
做这件事。
df[['EFFECTIVE DT', 'SYSDATE']] = \
df[['EFFECTIVE DT', 'SYSDATE']].apply(pd.to_datetime)
选项1
pir1
将groupby.apply
与数据框方法pd.DataFrame.nlargest
一起使用,您可以为最大的两个columns='SYSDATE'
传递参数n=2
和'SYSDATE'
。
df.groupby(
'EFFECTIVE DT', group_keys=False, sort=False
).apply(pd.DataFrame.nlargest, n=2, columns='SYSDATE')
IND ID value EFFECTIVE DT SYSDATE
0 8 A 19289 2017-06-30 2017-08-16 10:05:00
1 17 A 19289 2017-06-30 2017-08-15 14:25:00
3 7 A 18155 2017-03-31 2017-08-16 10:05:00
4 16 A 18155 2017-03-31 2017-08-15 14:25:00
6 6 A 21770 2016-12-31 2017-08-16 10:05:00
7 15 A 21770 2016-12-31 2017-08-15 14:25:00
9 5 A 19226 2016-09-30 2017-08-16 10:05:00
10 14 A 19226 2016-09-30 2017-08-15 14:25:00
12 4 A 20238 2016-06-30 2017-08-16 10:05:00
13 13 A 20238 2016-06-30 2017-08-15 14:25:00
15 3 A 18684 2016-03-31 2017-08-16 10:05:00
16 12 A 18684 2016-03-31 2017-08-15 14:25:00
18 2 A 22059 2015-12-31 2017-08-16 10:05:00
19 11 A 22059 2015-12-31 2017-08-15 14:25:00
21 1 A 19280 2015-09-30 2017-08-16 10:05:00
22 10 A 19280 2015-09-30 2017-08-15 14:25:00
24 0 A 20813 2015-06-30 2017-08-16 10:05:00
25 9 A 20813 2015-06-30 2017-08-15 14:25:00
如何运作
开始,拆分,将内容应用于拆分以及重新组合您的工作是pandas
的一个重要特征,并在split-apply-combine进行了解释。
groupby
元素应该是不言而喻的。我想按照'EFFECTIVE DT'
列中的日期定义的每天对数据进行分组。之后,您可以使用此groupby
对象执行许多操作。我决定应用一个函数,该函数将返回与'SYSDATE'
列的最大两个值对应的2行。这些最大值等于该组当天的最新值。
事实证明,有一种数据帧方法可以执行此任务,即返回与列的最大值对应的行。即,pd.DataFrame.nlargest
。
有两点需要注意:
groupby.apply
时,传递给正在应用的函数的对象是pd.DataFrame
对象。pd.DataFrame.nlargest
之类的方法作为函数时,期望的第一个参数是pd.DataFrame
对象。嗯,那很幸运,因为这正是我正在做的事情。
此外,groupby.apply
允许您通过kwargs
将其他关键字参数传递给应用函数。因此,我可以轻松传递n=2
和columns='SYSDATE'
。
选项2
pir2
与选项1相同的概念,但使用np.argpartion
def nlrg(d):
v = d.HOURS.values
a = np.argpartition(v, v.size - 2)[-2:]
return d.iloc[a]
pir2 = lambda d: d.groupby('DAYS', sort=False, group_keys=False).apply(nlrg)
选项3
pir4
使用numba.njit
我扫描列表跟踪最后两个最大值。
form numba import njit
@njit
def nlrg_nb(f, v, i, n):
b = (np.arange(n * 2) * 0).reshape(-1, 2)
e = b * np.nan
for x, y, z in zip(f, v, i):
if np.isnan(e[x, 0]):
e[x, 0] = y
b[x, 0] = z
elif y > e[x, 0]:
e[x, :] = [y, e[x, 0]]
b[x, :] = [z, b[x, 0]]
elif np.isnan(e[x, 1]):
e[x, 1] = y
b[x, 1] = z
elif y > e[x, 1]:
e[x, 1] = y
b[x, 1] = z
return b.ravel()[~np.isnan(e.ravel())]
def pir4(d):
f, u = pd.factorize(d.DAYS.values)
return d.iloc[nlrg_nb(f, d.HOURS.values.astype(float), np.arange(f.size), u.size)]
计时
结果
(lambda r: r.div(r.min(1), 0))(results)
pir1 pir2 pir4 jez1
100 24.205348 9.725718 1.0 4.449165
300 42.685989 15.754161 1.0 4.047182
1000 111.733703 39.822652 1.0 4.175235
3000 253.873888 74.280675 1.0 4.105493
10000 376.157526 125.323946 1.0 4.313063
30000 434.815009 145.513904 1.0 5.296250
模拟
def produce_test_df(i):
hours = pd.date_range('2000-01-01', periods=i, freq='H')[np.random.permutation(np.arange(i))]
days = hours.floor('D')
return pd.DataFrame(dict(HOURS=hours, DAYS=days))
results = pd.DataFrame(
index=[100, 300, 1000, 3000, 10000, 30000],
columns='pir1 pir2 pir4 jez1'.split(),
dtype=float,
)
for i in results.index:
d = produce_test_df(i)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=20))
results.plot(loglog=True)
功能
def nlrg(d):
v = d.HOURS.values
a = np.argpartition(v, v.size - 2)[-2:]
return d.iloc[a]
pir1 = lambda d: d.groupby('DAYS', group_keys=False, sort=False).apply(pd.DataFrame.nlargest, n=2, columns='HOURS')
pir2 = lambda d: d.groupby('DAYS', sort=False, group_keys=False).apply(nlrg)
jez1 = lambda d: d.sort_values(['DAYS', 'HOURS']).groupby('DAYS').tail(2)
@njit
def nlrg_nb(f, v, i, n):
b = (np.arange(n * 2) * 0).reshape(-1, 2)
e = b * np.nan
for x, y, z in zip(f, v, i):
if np.isnan(e[x, 0]):
e[x, 0] = y
b[x, 0] = z
elif y > e[x, 0]:
e[x, :] = [y, e[x, 0]]
b[x, :] = [z, b[x, 0]]
elif np.isnan(e[x, 1]):
e[x, 1] = y
b[x, 1] = z
elif y > e[x, 1]:
e[x, 1] = y
b[x, 1] = z
return b.ravel()[~np.isnan(e.ravel())]
def pir4(d):
f, u = pd.factorize(d.DAYS.values)
return d.iloc[nlrg_nb(f, d.HOURS.values.astype(float), np.arange(f.size), u.size)]
答案 1 :(得分:3)
我认为您sort_values
需要groupby
并返回GroupBy.head
:
#convert to datetime
df['SYSDATE'] = pd.to_datetime(df['SYSDATE'])
df['EFFECTIVE DT'] = pd.to_datetime(df['EFFECTIVE DT'])
df = df.sort_values(['EFFECTIVE DT','SYSDATE'], ascending=[True,False])
df = df.groupby('EFFECTIVE DT').head(2).sort_index()
print (df)
IND ID value EFFECTIVE DT SYSDATE
0 8 A 19289 2017-06-30 2017-08-16 10:05:00
1 17 A 19289 2017-06-30 2017-08-15 14:25:00
3 7 A 18155 2017-03-31 2017-08-16 10:05:00
4 16 A 18155 2017-03-31 2017-08-15 14:25:00
6 6 A 21770 2016-12-31 2017-08-16 10:05:00
7 15 A 21770 2016-12-31 2017-08-15 14:25:00
9 5 A 19226 2016-09-30 2017-08-16 10:05:00
10 14 A 19226 2016-09-30 2017-08-15 14:25:00
12 4 A 20238 2016-06-30 2017-08-16 10:05:00
13 13 A 20238 2016-06-30 2017-08-15 14:25:00
15 3 A 18684 2016-03-31 2017-08-16 10:05:00
16 12 A 18684 2016-03-31 2017-08-15 14:25:00
18 2 A 22059 2015-12-31 2017-08-16 10:05:00
19 11 A 22059 2015-12-31 2017-08-15 14:25:00
21 1 A 19280 2015-09-30 2017-08-16 10:05:00
22 10 A 19280 2015-09-30 2017-08-15 14:25:00
24 0 A 20813 2015-06-30 2017-08-16 10:05:00
25 9 A 20813 2015-06-30 2017-08-15 14:25:00
另一个类似的解决方案,谢谢piRSquared :
df = df.sort_values(['EFFECTIVE DT','SYSDATE']) \
.groupby('EFFECTIVE DT') \
.tail(2) \
.sort_index()
print (df)
IND ID value EFFECTIVE DT SYSDATE
0 8 A 19289 2017-06-30 2017-08-16 10:05:00
1 17 A 19289 2017-06-30 2017-08-15 14:25:00
3 7 A 18155 2017-03-31 2017-08-16 10:05:00
4 16 A 18155 2017-03-31 2017-08-15 14:25:00
6 6 A 21770 2016-12-31 2017-08-16 10:05:00
7 15 A 21770 2016-12-31 2017-08-15 14:25:00
9 5 A 19226 2016-09-30 2017-08-16 10:05:00
10 14 A 19226 2016-09-30 2017-08-15 14:25:00
12 4 A 20238 2016-06-30 2017-08-16 10:05:00
13 13 A 20238 2016-06-30 2017-08-15 14:25:00
15 3 A 18684 2016-03-31 2017-08-16 10:05:00
16 12 A 18684 2016-03-31 2017-08-15 14:25:00
18 2 A 22059 2015-12-31 2017-08-16 10:05:00
19 11 A 22059 2015-12-31 2017-08-15 14:25:00
21 1 A 19280 2015-09-30 2017-08-16 10:05:00
22 10 A 19280 2015-09-30 2017-08-15 14:25:00
24 0 A 20813 2015-06-30 2017-08-16 10:05:00
25 9 A 20813 2015-06-30 2017-08-15 14:25:00
<强>计时强>:
np.random.seed(345)
N = 100000
dates1 = [pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'),pd.Timestamp('2017-08-14 15:10:00')]
dates2 = pd.date_range('2015-01-01', periods=2000)
df = pd.DataFrame({'EFFECTIVE DT':np.random.choice(dates2, N),
'SYSDATE':np.random.choice(dates1, N)})
In [104]: %timeit df.groupby('EFFECTIVE DT', group_keys=False, sort=False).apply(pd.DataFrame.nlargest, n=2, columns='SYSDATE')
1 loop, best of 3: 2.61 s per loop
In [105]: %timeit df.sort_values(['EFFECTIVE DT','SYSDATE'], ascending=[True,False]).groupby('EFFECTIVE DT').head(2).sort_index()
10 loops, best of 3: 25.4 ms per loop
In [106]: %timeit df.sort_values(['EFFECTIVE DT','SYSDATE']).groupby('EFFECTIVE DT').tail(2).sort_index()
10 loops, best of 3: 23.3 ms per loop