我想计算一个pandas列的rolling_max,其中窗口大小不同,是当前行索引与满足某个条件的行之间的差异。
所以,作为一个例子,我有:
df = pd.DataFrame({'a': [0,1,0,0,0,1,0,0,0,0,1,0],
'b': [5,4,3,6,1,2,3,4,2,1,7,8]})
我想要一个df.b的rolling_max,因为前一次是df.a == 1。即我想得到这个:
a b rm
0 0 5 NaN <- no previous a==1
1 1 4 4 <- a==1
2 0 3 4
3 0 6 6
4 0 1 6
5 1 2 2 <- a==1
6 0 3 3
7 0 4 4
8 0 2 4
9 0 1 4
10 1 7 7 <- a==1
11 0 8 8
我的df有一个没有间隙的整数索引,所以我试着这样做:
df['last_a'] = np.where(df.a == 1, df.index, np.nan)
df['last_a'].fillna(method='ffill', inplace=True)
df['rm'] = pd.rolling_max(df['b'], window = df.index - df['last_a'] + 1)
但是我得到了一个TypeError:需要一个整数。
这是在相当大的数据框架上运行的长脚本的一部分,所以我需要尽可能快的解决方案。我已成功尝试使用循环而不是rolling_max来执行此操作,但它非常慢。你能帮忙吗?
仅供参考。我现在拥有的丑陋和长循环,无论它的丑陋,在我的数据框架上看起来相当快(50,000 x 25的测试),如下所示:
df['rm2'] = df.b
df['rm1'] = np.where( (df['a'] == 1) | (df['rm2'].diff() > 0), df['rm2'], np.nan)
df['rm1'].fillna(method = 'ffill', inplace = True)
df['Dif'] = (df['rm1'] - df['rm2']).abs()
while df['Dif'].sum() != 0:
df['rm2'] = df['rm1']
df['rm1'] = np.where( (df['a'] == 1) | (df['rm2'].diff() > 0), df['rm2'], np.nan)
df['rm1'].fillna(method = 'ffill', inplace = True)
df['Dif'] = (df['rm1'] - df['rm2']).abs()
答案 0 :(得分:2)
我会创建一个索引并使用groupby
此索引来使用cummax
:
import numpy as np
df['index'] = df['a'].cumsum()
df['rm'] = df.groupby('index')['b'].cummax()
df.loc[df['index']==0, 'rm'] = np.nan
In [104]: df
Out[104]:
a b index rm
0 0 5 0 NaN
1 1 4 1 4
2 0 3 1 4
3 0 6 1 6
4 0 1 1 6
5 1 2 2 2
6 0 3 2 3
7 0 4 2 4
8 0 2 2 4
9 0 1 2 4
10 1 7 3 7
11 0 8 3 8
答案 1 :(得分:0)
实际上,只要您需要重构涉及列和表之间关系的数据,请考虑使用关系数据库管理系统(RDMS)的SQL解决方案。如果您的数据来自数据库,请特别注意。让Pandas进行数据分析。当然,如果您要存储的数据不在数据库中,那么这就是另一个问题!
Python配备了SQLite的内置库,这是一个流行的免费开源文件级数据库。此外,可以安装MySQL,SQL Server,PostgreSQL,Oracle和其他RDMS的Python库。您可以将每个连接无缝集成到pandas中。以下是三个等效版本的查询,以实现条件组最大值。每个假设您在源表中维护一个自动编号主键索引ID
,在此处命名为RollingMax
。
import sqlite3 as lite
import pandas as pd
con = lite.connect('C:\\Path\\SQLite\\DB.db')
# SQL WITH DERIVED TABLES
sql = """SELECT a, b,
(SELECT Max(dtbl2.B)
FROM
(SELECT t1.ID, t1.a, t1.b,
(SELECT Count(*) FROM RollingMax t2
WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
FROM RollingMax t1) dtbl2
WHERE dtbl1.ID >= dtbl2.ID
AND dtbl1.GrpA = dtbl2.GrpA) As rm
FROM
(
SELECT t1.ID, t1.a, t1.b,
(SELECT Count(*) FROM RollingMax t2
WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
FROM RollingMax t1
) As dtbl1;"""
# SQL USING CTE WINDOW FUNCTION (AVAILABLE AS OF VERSION 3.8.3)
sql = """WITH grp (ID, a, b, GrpA)
AS (
SELECT t1.ID, t1.a, t1.b,
(SELECT Count(*) FROM RollingMax t2
WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
FROM RollingMax t1
)
SELECT a, b,
(SELECT Max(dtbl2.B)
FROM grp AS dtbl2
WHERE dtbl1.ID >= dtbl2.ID
AND dtbl1.GrpA = dtbl2.GrpA) As rm
FROM grp AS dtbl1;"""
# SQL USING SAVED VIEW
'''To be saved inside database'''
saved_view = """SELECT t1.ID, t1.a, t1.b,
(SELECT Count(*) FROM RollingMax t2
WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
FROM RollingMax t1;"""
sql = """SELECT a, b,
(SELECT Max(dtbl2.B)
FROM saved_view AS dtbl2
WHERE dtbl1.ID >= dtbl2.ID
AND dtbl1.GrpA = dtbl2.GrpA) As rm
FROM saved_view As dtbl1;"""
df = pd.read_sql(sql, conn)
输出 (这里唯一的挑战是没有先前== 1的第一个分组)
a b rm
0 5 5
1 4 4
0 3 4
0 6 6
0 1 6
1 2 2
0 3 3
0 4 4
0 2 4
0 1 4
1 7 7
0 8 8