平衡pandas数据帧中的单元块

时间:2016-12-07 07:41:20

标签: python pandas numpy dataframe average

由于难以解释的原因,我想在pandas数据帧中平均细胞块,该数据帧稀疏地填充了随机值。数据帧将始终具有sqrt(列数x索引数)值 - 其余所有值均为NaN。这些值大致均匀分布,因此如果我平均正确大小的单元格块,我希望每个块中都有一个值。

这是我的例子。对于100列和100个索引,我在整个数据帧中随机分布了100个值。我希望每10x10块有〜1个值,其他所有块都是NaN。如何将每个10x10块转换为一个单元格(平均10列,10个索引和值)?

我的代码:

import pandas as pd
import numpy as np
import math

number_of_planes = 100

thicknesses = np.empty(number_of_planes)
cos_thetas = np.empty(number_of_planes)
phis = np.empty(number_of_planes)
for i in range(0,number_of_planes):
    r = 1
    phi = np.random.uniform(0,2*math.pi)
    theta = math.acos(2*np.random.uniform(0.5,1) - 1)
    thickness = np.random.uniform(0,0.4)

    phis[i] = phi
    cos_thetas[i] = math.cos(theta)
    thicknesses[i] = thickness

thick_df = pd.DataFrame(columns=phis, index=cos_thetas)

for i in range(0, len(thicknesses)):
    thick_df.set_value(cos_thetas[i], phis[i], thicknesses[i], takeable=False)

thick_df = thick_df.sort_index(axis=0, ascending=False)
thick_df = thick_df.sort_index(axis=1)

2 个答案:

答案 0 :(得分:3)

IIUC你可以重塑成一个4D阵列,将每个轴分成两个长度为sqrt(len of each axis)的轴,沿第二和第四轴计算平均值,忽略NaNsnp.nanmean -

arr = thick_df.values.astype(float)
n = int(np.sqrt(number_of_planes))

out = np.nanmean(arr.reshape(n,n,n,n),axis=(1,3))

indx = thick_df.index.values.reshape(-1,n).mean(1)
coln = thick_df.columns.values.reshape(-1,n).mean(1)
df_out = pd.DataFrame(out, index=indx, columns= coln)

示例运行 -

In [174]: thick_df # number_of_planes = 4
Out[174]: 
          4.550477  5.138694  5.411510 6.123163
0.981987       NaN       NaN  0.393233      NaN
0.565861  0.186647       NaN       NaN      NaN
0.193190       NaN       NaN       NaN  0.11626
0.088382       NaN  0.166189       NaN      NaN

In [175]: df_out
Out[175]: 
          4.844586  5.767337
0.773924  0.186647  0.393233
0.140786  0.166189  0.116260

答案 1 :(得分:3)

m, n = 10, 10
row_groups = np.arange(len(thick_df.index)) // m
col_groups = np.arange(len(thick_df.columns)) // n

grpd = pd.DataFrame(thick_df.values, row_groups, col_groups)

val = pd.to_numeric(grpd.stack(), 'coerce').groupby(level=[0, 1]).mean().unstack().values
idx = thick_df.index.to_series().groupby(row_groups).mean().values
col = thick_df.columns.to_series().groupby(col_groups).mean().values

pd.DataFrame(val, idx, col)

enter image description here