假设我有一个Pandas DataFrame <!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script src="./files/jquery-1.11.2.min.js"></script>
<script src="./files/bootstrap.min.js"></script>
<link rel="stylesheet" href="./files/font-awesome.min.css">
<style>
body {
font-family: "Helvetica Neue", Helvetica, Arial, NanumBarunGothic, NanumGothic, "Apple SD Gothic Neo", sans-serif;
}
a {
font-size: 36px;
font-weight: 500;
text-decoration: none;
transition: color 0.3s;
color: #0099cc;
background-color: transparent;
box-sizing: border-box;
}
a:hover {
color: #4dd2ff;
outline: none;
border-bottom: 1px dotted;
}
hr {
margin-bottom: 23px;
border: 0;
border-top: 1px solid #b8b8b8;
}
.button2 {
position: absolute;
}
</style>
<script>
function alertKWEB() {
window.alert("Me too");
}
function alertKWEB2() {
window.alert("K★W★E★B");
}
function moveButtonRand() {
var buttonTag=document.getElementsByClassName('button2');
var positionTop=Math.floor(Math.random()*90+5);
var positionLeft=Math.floor(Math.random()*90+5);
buttonTag.style.top=positionTop+"%";
buttonTag.style.left=positionLeft+"%";
}
</script>
</head>
<body>
<div class="main" style="text-align: center; width: 100%; height: 100%">
<h1><a href="https://kweb.korea.ac.kr/">Do you love KWEB?</a></h1>
<hr>
<button onclick="alertKWEB()">I do</button>
<button class="button2" onclick="alertKWEB2()" onmouseover="moveButtonRand()">.....</button>
</div>
</body>
</html>
:
df
对于每一行,我想有效地计算自上次出现Date Value
01/01/17 0
01/02/17 0
01/03/17 1
01/04/17 0
01/05/17 0
01/06/17 0
01/07/17 1
01/08/17 0
01/09/17 0
以来的天数。
那样Value=1
:
df
我可以做一个循环:
Date Value Last_Occurence
01/01/17 0 NaN
01/02/17 0 NaN
01/03/17 1 0
01/04/17 0 1
01/05/17 0 2
01/06/17 0 3
01/07/17 1 0
01/08/17 0 1
01/09/17 0 2
但对于极大的数据集而言似乎效率非常低,而且可能无论如何都不正确。
答案 0 :(得分:6)
这是一种NumPy方法 -
def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
out = np.ones(a.size,dtype=int)
idx = np.flatnonzero(a==trigger_val)
if len(idx)==0:
return np.full(a.size,invalid_specifier)
else:
out[idx[0]] = -idx[0] + 1
out[0] = start_val
out[idx[1:]] = idx[:-1] - idx[1:] + 1
np.cumsum(out, out=out)
out[:idx[0]] = invalid_specifier
return out
在阵列数据上运行的示例很少,以展示涵盖触发器和起始值的各种场景的用法:
In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])
In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
...:
In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]:
array([[ 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0],
[-1, 0, 0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0, 0, 1],
[-1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 2],
[ 0, 1, 2, 3, 0, 0, 1, 0, 0, 1, 2, 3, 4, 5, 0],
[ 1, 2, 3, 4, 1, 1, 2, 1, 1, 2, 3, 4, 5, 6, 1]])
用它来解决我们的案例:
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
示例输出 -
In [181]: df
Out[181]:
Date Value Last_Occurence
0 01/01/17 0 -1
1 01/02/17 0 -1
2 01/03/17 1 0
3 01/04/17 0 1
4 01/05/17 0 2
5 01/06/17 0 3
6 01/07/17 1 0
7 01/08/17 0 1
8 01/09/17 0 2
运行时测试
方法 -
# @Scott Boston's soln
def pandas_groupby(df):
mask = df.Value.cumsum().replace(0,False).astype(bool)
return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
cumsum()).cumcount().where(mask))
# Proposed in this post
def numpy_based(df):
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
计时 -
In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])
In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop
In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop
In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])
In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop
In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop
答案 1 :(得分:2)
让我们使用cumsum
,cumcount
和groupby
尝试此操作:
mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)
输出:
Date Value Last_Occurance
0 01/01/17 0 NaN
1 01/02/17 0 NaN
2 01/03/17 1 0.0
3 01/04/17 0 1.0
4 01/05/17 0 2.0
5 01/06/17 0 3.0
6 01/07/17 1 0.0
7 01/08/17 0 1.0
8 01/09/17 0 2.0
答案 2 :(得分:1)
您不必在for循环中的每一步都将值更新为begin try
truncata table dbo.YourTableName;
end try
begin catch
delete from dbo.YourTableName;
end catch
。在循环外部启动变量
last
并仅在last = np.nan
for i in range(len(df)):
if df.loc[i, 'Value'] == 1:
last = i
df.loc[i, 'Last_Occurence'] = i - last
列中出现1
时更新。
请注意,无论您选择何种方法,迭代整个表一次都是不可避免的。
答案 3 :(得分:1)
您可以使用argmax:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 2
dtype: int64
如果前两行必须有nan,请使用:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
if 1 in df.iloc[x.name::-1].Value.values \
else np.nan,axis=1)
Out[86]:
0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
5 3.0
6 0.0
7 1.0
8 2.0
dtype: float64