我有一个数据框,我正在逐行操作,我目前正在使用iterrows(),我知道它很慢,而宁愿使用apply()。但是,我不确定如何应用(如果可能的话)。
边缘'数据:
time raw_signal amp_change edge edge_dir
2.73105 499.878 -22.583 TRUE decr
2.7311 477.295 -24.414 TRUE decr
2.73115 452.881 -25.025 TRUE decr
2.7312 427.856 -21.362 TRUE decr
2.7315 412.598 28.076 TRUE incr
2.73155 440.674 25.024 TRUE incr
8.5267 490.112 -24.414 TRUE decr
8.52675 465.698 -30.517 TRUE decr
8.5268 435.181 -25.635 TRUE decr
8.70805 413.208 21.362 TRUE incr
8.7081 434.57 24.414 TRUE incr
10.7113 487.671 -20.752 TRUE decr
10.71135 466.919 -34.79 TRUE decr
10.7114 432.129 -37.842 TRUE decr
10.71145 394.287 -24.414 TRUE decr
10.9586 367.432 25.634 TRUE incr
10.95865 393.066 34.79 TRUE incr
10.9587 427.856 32.349 TRUE incr
10.95875 460.205 20.142 TRUE incr
12.35745 477.295 -23.193 TRUE decr
应用于每一行的功能
start = None
dir = None
sum_amp = 0
for index, row in edges.iterrows():
# this will collapse the multiple incr/decr together by taking only the first one seen
# the others will get their edge set to False
# it also assumes that the distance been multiple incr/decr is less than some threshold
if start == None:
start = index
dir = row.edge_dir
sum_amp = row.amp_change
else:
if row.edge_dir == dir and abs(start - index) < 0.01:
edges.loc[index,'edge'] = False
sum_amp += row.amp_change # sum amp increase so we can get an overall for this edge
else:
edges.loc[start,'amp_change'] = sum_amp
sum_amp = row.amp_change
start = index
dir = row.edge_dir
应该产生
time raw_signal amp_change edge edge_dir
2.73105 499.878 -93.384 TRUE decr
2.7311 477.295 -24.414 FALSE decr
2.73115 452.881 -25.025 FALSE decr
2.7312 427.856 -21.362 FALSE decr
2.7315 412.598 53.1 TRUE incr
2.73155 440.674 25.024 FALSE incr
8.5267 490.112 -80.566 TRUE decr
8.52675 465.698 -30.517 FALSE decr
8.5268 435.181 -25.635 FALSE decr
8.70805 413.208 45.776 TRUE incr
8.7081 434.57 24.414 FALSE incr
10.7113 487.671 -117.798 TRUE decr
10.71135 466.919 -34.79 FALSE decr
10.7114 432.129 -37.842 FALSE decr
10.71145 394.287 -24.414 FALSE decr
10.9586 367.432 112.915 TRUE incr
10.95865 393.066 34.79 FALSE incr
10.9587 427.856 32.349 FALSE incr
10.95875 460.205 20.142 FALSE incr
12.35745 477.295 -23.193 TRUE decr
答案 0 :(得分:2)
这个oneliner怎么样:
In [16]:
df['New_amp_change'] = np.hstack((np.diff(~(np.sign(df.amp_change.shift(1))<0)), True))
In [40]:
df.ix[df.New_amp_change,'amp_change'] = df.groupby(df.New_amp_change.cumsum()).amp_change.sum().values
In [42]:
print df
time raw_signal amp_change edge edge_dir New_amp_change
0 2.73105 499.878 -93.384 True decr True
1 2.73110 477.295 -24.414 True decr False
2 2.73115 452.881 -25.025 True decr False
3 2.73120 427.856 -21.362 True decr False
4 2.73150 412.598 53.100 True incr True
5 2.73155 440.674 25.024 True incr False
6 8.52670 490.112 -80.566 True decr True
7 8.52675 465.698 -30.517 True decr False
8 8.52680 435.181 -25.635 True decr False
9 8.70805 413.208 45.776 True incr True
10 8.70810 434.570 24.414 True incr False
11 10.71130 487.671 -117.798 True decr True
12 10.71135 466.919 -34.790 True decr False
13 10.71140 432.129 -37.842 True decr False
14 10.71145 394.287 -24.414 True decr False
15 10.95860 367.432 112.915 True incr True
16 10.95865 393.066 34.790 True incr False
17 10.95870 427.856 32.349 True incr False
18 10.95875 460.205 20.142 True incr False
19 12.35745 477.295 -23.193 True decr True
1,将amp_change
移动一个位置(shift(1)
)
2,检查标志,返回True
为负数
3,检查标志是否已更改(np.diff()
)
4,在最后填充True
(np.diff()
返回更短的向量1元素)
5,groupby使用新创建的New_amp_change列
获取组总和6,将组总和分配回原始数据框中的符号更改行(边?)。