我无法找到解决此问题的优雅解决方案(可能没有)。
我有以下示例DataFrame:
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10,10))。abs()
0 1 2 3 4 5 6 \
0 1.764052 0.400157 0.978738 2.240893 1.867558 0.977278 0.950088
1 0.144044 1.454274 0.761038 0.121675 0.443863 0.333674 1.494079
2 2.552990 0.653619 0.864436 0.742165 2.269755 1.454366 0.045759
3 0.154947 0.378163 0.887786 1.980796 0.347912 0.156349 1.230291
4 1.048553 1.420018 1.706270 1.950775 0.509652 0.438074 1.252795
5 0.895467 0.386902 0.510805 1.180632 0.028182 0.428332 0.066517
6 0.672460 0.359553 0.813146 1.726283 0.177426 0.401781 1.630198
7 0.729091 0.128983 1.139401 1.234826 0.402342 0.684810 0.870797
8 1.165150 0.900826 0.465662 1.536244 1.488252 1.895889 1.178780
9 0.403177 1.222445 0.208275 0.976639 0.356366 0.706573 0.010500
7 8 9
0 0.151357 0.103219 0.410599
1 0.205158 0.313068 0.854096
2 0.187184 1.532779 1.469359
3 1.202380 0.387327 0.302303
4 0.777490 1.613898 0.212740
5 0.302472 0.634322 0.362741
6 0.462782 0.907298 0.051945
7 0.578850 0.311553 0.056165
8 0.179925 1.070753 1.054452
9 1.785870 0.126912 0.401989
我有以下区域地图:
zones = {“A”:[0,1,2],“B”:[3,4],“C”:[5,6,7,8],“D”:[9]}
区域显示我应该一起检查的列组和df [columns] DataFrame的每个行,保留前N个项目( NB :保留在该行的前N个项目,即横截面 - 见后面),将其余部分设置为零。例如,对于N = 2的区域“A”,我将检查以下DataFrame:
0 1 2
0 1.764052 0.400157 0.978738
1 0.144044 1.454274 0.761038
2 2.552990 0.653619 0.864436
3 0.154947 0.378163 0.887786
4 1.048553 1.420018 1.706270
5 0.895467 0.386902 0.510805
6 0.672460 0.359553 0.813146
7 0.729091 0.128983 1.139401
8 1.165150 0.900826 0.465662
9 0.403177 1.222445 0.208275
因为N = 2我将保留前N项:
0 1 2
0 1.764052 0. 0.978738
1 0. 1.454274 0.761038
2 2.552990 0. 0.864436
3 0. 0.378163 0.887786
4 0. 1.420018 1.706270
5 0.895467 0. 0.510805
6 0.672460 0. 0.813146
7 0.729091 0. 1.139401
8 1.165150 0.900826 0.
9 0.403177 1.222445 0.
上面带有区域地图且N = 2的整个输出将如下所示:
0 1 2 3 4 5 6 \
0 1.764052 0. 0.978738 2.240893 1.867558 0.977278 0.950088
1 0. 1.454274 0.761038 0.121675 0.443863 0.333674 1.494079
2 2.552990 0. 0.864436 0.742165 2.269755 1.454366 0.
3 0. 0.378163 0.887786 1.980796 0.347912 0. 1.230291
4 0. 1.420018 1.706270 1.950775 0.509652 0. 1.252795
5 0.895467 0. 0.510805 1.180632 0.028182 0.428332 0.
6 0.672460 0. 0.813146 1.726283 0.177426 0. 1.630198
7 0.729091 0. 1.139401 1.234826 0.402342 0.684810 0.870797
8 1.165150 0.900826 0. 1.536244 1.488252 1.895889 1.178780
9 0.403177 1.222445 0. 0.976639 0.356366 0.706573 0.
7 8 9
0 0. 0. 0.410599
1 0. 0. 0.854096
2 0. 1.532779 1.469359
3 1.202380 0. 0.302303
4 0. 1.613898 0.212740
5 0. 0.634322 0.362741
6 0. 0.907298 0.051945
7 0. 0. 0.056165
8 0. 0. 1.054452
9 1.785870 0. 0.401989
我试图解决这个问题的方式感觉有点慢。我循环遍历区域,然后我得到一个zone_df,然后我循环遍历行,排序每一行并调用row.head(len(row) - N)以获取需要设置为0的索引和列。然后使用这些值(在dict中)将zone_df中的单元格设置为零,然后组合zone_dfs。
答案 0 :(得分:5)
以这种方式 -
def keeptopN_perkey(df, zones, N=2):
a = df.values
indx = zones.values()
r = np.arange(a.shape[0])[:,None]
for i in indx:
b = a[:,i]
L = np.maximum(len(i)-N,0)
if L>0:
idx = np.argpartition(b, L, axis=1)[:,:L]
# or np.argsort(b,axis=1)[:,:L]
b[r, idx] = 0
a[:,i] = b
return df
好处是我们正在回写输入数据帧,而无需在使用底层数组数据的情况下创建输出数据帧。
示例运行 -
In [303]: np.random.seed(0)
...: N = 2
...: df = pd.DataFrame(np.random.randint(11,99,(4,10)))
...: zones = {"A": [0,1,2], "B": [3,4], "C": [5, 6,7,8], "D": [9]}
...:
In [304]: df
Out[304]:
0 1 2 3 4 5 6 7 8 9
0 55 58 75 78 78 20 94 32 47 98
1 81 23 69 76 50 98 57 92 48 36
2 88 83 20 31 91 80 90 58 75 93
3 60 40 30 30 25 50 43 76 20 68
In [305]: keeptopN_perkey(df, zones, N=2)
Out[305]:
0 1 2 3 4 5 6 7 8 9
0 0 58 75 78 78 0 94 0 47 98
1 81 0 69 76 50 98 0 92 0 36
2 88 83 0 31 91 80 90 0 0 93
3 60 40 0 30 25 50 0 76 0 68
其他职位的方法 -
def mask_n(df, n): # @piRSquared's helper func
v = np.zeros(df.shape, dtype=bool)
n = min(n, v.shape[1])
if v.shape[1] > n:
j = np.argpartition(-df.values, n, 1)[:, :n].ravel()
i = np.arange(v.shape[0]).repeat(n)
v[i, j] = True
return df.where(v, 0)
else:
return df
def piRSquared1(df, zones): # @piRSquared's soln1
zinv = {v: k for k in zones for v in zones[k]}
return df.groupby(zinv, 1).apply(mask_n, n=2)
def piRSquared2(df, zones): # @piRSquared's soln2
zinv = {v: k for k in zones for v in zones[k]}
return df.mask(df.groupby(zinv, 1).rank(axis=1, method='first',
ascending=False) > 2, 0)
def COLDSPEED1(df, zones): # @COLDSPEED's soln
for z in zones:
df2 = df.iloc[:, zones[z]]
df.iloc[:, zones[z]] = \
np.where(((-df2).rank(axis=1) - 1) >= 2, 0, df2.values)
return df
def s5s1(df, zones, N=2): # @s5s's soln
final = []
for zone_id, cols in zones.iteritems():
values = {}
d = df[cols] # zone A
for i, row in d.iterrows():
if len(row) > N:
row.sort()
row[row.head(len(row) - N).index] = 0
values[i] = row
d = pd.DataFrame(values).T
final.append(d)
return pd.concat(final, axis=1)[df.columns]
更大数据集上的计时 -
In [458]: # Setup
...: ncols = 1000
...: cuts = np.sort(np.random.choice(ncols, ncols//3, replace=0))
...: indx_split = np.split(np.arange(ncols),cuts)
...: zones = {i:p_i for i,p_i in enumerate(list(map(list,indx_split)))}
...: df = pd.DataFrame(np.random.randint(11,99,(10,ncols)))
...: N = 2
...:
...: df1 = df.copy()
...: df2 = df.copy()
...: df3 = df.copy()
...: df4 = df.copy()
...: df5 = df.copy()
...:
In [459]: %timeit COLDSPEED1(df1, zones)
...: %timeit piRSquared1(df2, zones)
...: %timeit piRSquared2(df3, zones)
...: %timeit s5s1(df4, zones)
...: %timeit keeptopN_perkey(df5, zones)
...:
1 loop, best of 3: 324 ms per loop
10 loops, best of 3: 116 ms per loop
10 loops, best of 3: 81.6 ms per loop
1 loop, best of 3: 1.47 s per loop
100 loops, best of 3: 2.99 ms per loop
答案 1 :(得分:4)
给定数据框子片段:
df
0 1 2
0 1.764052 0.400157 0.978738
1 0.144044 1.454274 0.761038
2 2.552990 0.653619 0.864436
3 0.154947 0.378163 0.887786
4 1.048553 1.420018 1.706270
5 0.895467 0.386902 0.510805
6 0.672460 0.359553 0.813146
7 0.729091 0.128983 1.139401
8 1.165150 0.900826 0.465662
9 0.403177 1.222445 0.208275
应用df.rank
并将所有值>= N
设置为0
:
v = df.values
v = df.iloc[:, zones[z]] = np.where(((-df2)\
.rank(axis=1) - 1) >= 2, 0, df2.values)
v
array([[ 1.764052, 0. , 0.978738],
[ 0. , 1.454274, 0.761038],
[ 2.55299 , 0. , 0.864436],
[ 0. , 0.378163, 0.887786],
[ 0. , 1.420018, 1.70627 ],
[ 0.895467, 0. , 0.510805],
[ 0.67246 , 0. , 0.813146],
[ 0.729091, 0. , 1.139401],
[ 1.16515 , 0.900826, 0. ],
[ 0.403177, 1.222445, 0. ]])
推广到您的数据框,您有:
for z in zones:
df2 = df.iloc[:, zones[z]]
df.iloc[:, zones[z]] = \
np.where(((-df2).rank(axis=1) - 1) >= 2, 0, df2.values)
df
0 1 2 3 4 5 6 \
0 1.76405 0 0.978738 2.24089 1.86756 0.977278 0.950088
1 0 1.45427 0.761038 0.121675 0.443863 0.333674 1.49408
2 2.55299 0 0.864436 0.742165 2.26975 1.45437 0
3 0 0.378163 0.887786 1.9808 0.347912 0 1.23029
4 0 1.42002 1.70627 1.95078 0.509652 0 1.2528
5 0.895467 0 0.510805 1.18063 0.0281822 0.428332 0
6 0.67246 0 0.813146 1.72628 0.177426 0 1.6302
7 0.729091 0 1.1394 1.23483 0.402342 0.68481 0.870797
8 1.16515 0.900826 0 1.53624 1.48825 1.89589 1.17878
9 0.403177 1.22245 0 0.976639 0.356366 0.706573 0
7 8 9
0 0 0 0.410599
1 0 0 0.854096
2 0 1.53278 1.46936
3 1.20238 0 0.302303
4 0 1.6139 0.21274
5 0 0.634322 0.362741
6 0 0.907298 0.0519454
7 0 0 0.0561653
8 0 0 1.05445
9 1.78587 0 0.401989
答案 2 :(得分:4)
选项1
使用np.argpartition
zinv = {v: k for k in zones for v in zones[k]}
def mask_n(df, n):
v = np.zeros(df.shape, dtype=bool)
n = min(n, v.shape[1])
if v.shape[1] > n:
j = np.argpartition(-df.values, n, 1)[:, :n].ravel()
i = np.arange(v.shape[0]).repeat(n)
v[i, j] = True
return df.where(v, 0)
else:
return df
df.groupby(zinv, 1).apply(mask_n, n=2)
选项2
Usint rank
zinv = {v: k for k in zones for v in zones[k]}
df.mask(df.groupby(zinv, 1).rank(axis=1, method='first', ascending=False) > 2, 0)
答案 3 :(得分:0)
好的,我最初编写了这个解决方案,因此我将其作为另一个版本添加到此处。
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10,10)).abs()
N = 2
zones = {"A": [0,1,2], "B": [3,4], "C": [5,6,7,8], "D": [9]}
final = []
for zone_id, cols in zones.iteritems():
values = {}
d = df[cols] # zone A
for i, row in d.iterrows():
if len(row) > N:
row.sort()
row[row.head(len(row) - N).index] = 0
values[i] = row
d = pd.DataFrame(values).T
final.append(d)
result = pd.concat(final, axis=1)[df.columns]
测试答案是一样的:
expected = pd.DataFrame({0: [1.764052, 0., 0.978738, 2.240893, 1.867558, 0.977278, 0.950088, 0., 0., 0.410599],
1: [0., 1.454274, 0.761038, 0.121675, 0.443863, 0.333674, 1.494079, 0., 0., 0.854096],
2: [2.552990, 0., 0.864436, 0.742165, 2.269755, 1.454366, 0., 0., 1.532779, 1.469359],
3: [0., 0.378163, 0.887786, 1.980796, 0.347912, 0., 1.230291, 1.202380, 0., 0.302303],
4: [0., 1.420018, 1.706270, 1.950775, 0.509652, 0., 1.252795, 0., 1.613898, 0.212740],
5: [0.895467, 0., 0.510805, 1.180632, 0.028182, 0.428332, 0., 0., 0.634322, 0.362741],
6: [0.672460, 0., 0.813146, 1.726283, 0.177426, 0., 1.630198, 0., 0.907298, 0.051945],
7: [0.729091, 0., 1.139401, 1.234826, 0.402342, 0.684810, 0.870797, 0., 0., 0.056165],
8: [1.165150, 0.900826, 0., 1.536244, 1.488252, 1.895889, 1.178780, 0., 0., 1.054452],
9: [0.403177, 1.222445, 0., 0.976639, 0.356366, 0.706573, 0., 1.785870, 0., 0.401989],
}).T
assert (expected - result).abs().sum().sum() < 0.001