我试图从一组行修改一组列,当然我收到警告:
A value is trying to be set on a copy of a slice from a DataFrame
我看到了一个类似的问题here,但无法绕过它。
因此,如果我们遵循此示例代码:
from random import random as rd
ex= pd.DataFrame([{"group": ["a","b"][int(round(rd()))], "colA": rd()*10, "colB": rd()*10, "colC": rd()*10, "colD": rd()*10} for _ in range(20)])
cols = [col for col in ex.columns if col != "group"]
我只想修改属于group a
的行,而只修改cols
列,我可以直观地尝试(并获得警告):
ex[ex["group"]=="a"][cols] = ex[ex["group"]=="a"][cols]/ex.ix[0,cols]
列数匹配并且具有相同的标签,所以我想知道是否必须一个接一个地去:
for idx in ex[ex["group"]=="a"].index:
for col in cols:
ex.ix[idx, col]=ex.ix[idx, col]/ex.ix[0,col]
这当然有效,但有点像退后一步。那么做这样的事情的正确方法是什么?
答案 0 :(得分:1)
IIUC你可以使用.loc
,你的布尔条件一步完成这一步并传递cols列表:
In [110]:
from random import random as rd
ex= pd.DataFrame([{"group": ["a","b"][int(round(rd()))], "colA": rd()*10, "colB": rd()*10, "colC": rd()*10, "colD": rd()*10} for _ in range(20)])
cols = [col for col in ex.columns if col != "group"]
ex
Out[110]:
colA colB colC colD group
0 5.895114 3.961007 0.589091 9.846131 a
1 1.789049 7.532745 2.767378 9.144689 b
2 1.218778 2.715299 3.626688 6.516540 a
3 9.327049 3.207037 4.513850 1.910565 b
4 1.822876 0.049689 0.794706 8.463579 a
5 1.451741 6.045066 6.575130 4.882635 b
6 6.741825 4.253489 2.162466 1.050275 a
7 5.186613 3.401384 1.055468 4.060071 a
8 0.921352 8.076272 6.727293 3.219364 a
9 3.209232 8.883085 9.696195 4.089006 b
10 0.970030 6.412611 5.377420 5.475744 b
11 7.905807 4.576925 6.991989 2.974597 b
12 4.907642 7.123328 9.851058 2.337944 b
13 1.191606 2.636071 5.740342 3.301008 b
14 1.454777 3.086801 3.573110 1.402692 b
15 3.253882 1.853393 5.156287 8.268881 b
16 4.779060 4.689739 1.228976 6.339238 b
17 7.950160 4.973974 4.304821 4.492152 b
18 0.581628 6.860053 2.974577 6.542594 a
19 6.872025 9.216597 0.936447 5.518941 b
In [111]:
ex.loc[ex['group']=='a', cols] /= ex.iloc[0][cols]
ex
Out[111]:
colA colB colC colD group
0 1.000000 1.000000 1.000000 1.000000 a
1 1.789049 7.532745 2.767378 9.144689 b
2 0.206744 0.685507 6.156417 0.661838 a
3 9.327049 3.207037 4.513850 1.910565 b
4 0.309218 0.012545 1.349039 0.859584 a
5 1.451741 6.045066 6.575130 4.882635 b
6 1.143629 1.073840 3.670853 0.106669 a
7 0.879816 0.858717 1.791690 0.412352 a
8 0.156291 2.038944 11.419789 0.326967 a
9 3.209232 8.883085 9.696195 4.089006 b
10 0.970030 6.412611 5.377420 5.475744 b
11 7.905807 4.576925 6.991989 2.974597 b
12 4.907642 7.123328 9.851058 2.337944 b
13 1.191606 2.636071 5.740342 3.301008 b
14 1.454777 3.086801 3.573110 1.402692 b
15 3.253882 1.853393 5.156287 8.268881 b
16 4.779060 4.689739 1.228976 6.339238 b
17 7.950160 4.973974 4.304821 4.492152 b
18 0.098663 1.731896 5.049437 0.664484 a
19 6.872025 9.216597 0.936447 5.518941 b
<强>计时强>
In [112]:
%%timeit
for idx in ex[ex["group"]=="a"].index:
for col in cols:
ex.ix[idx, col]=ex.ix[idx, col]/ex.ix[0,col]
100 loops, best of 3: 11 ms per loop
In [113]:
%timeit ex.loc[ex['group']=='a', cols] /= ex.iloc[0][cols]
100 loops, best of 3: 5.3 ms per loop
因此,在您的小样本量上,我的方法速度提高了2倍,我希望它可以通过更大的数据集进行更好的扩展,因为它的矢量化