Question

我有一个pandas DataFrame，它包含属于每个类（列）的每个样本的概率。碰巧的是，几乎99％的班级都有< 0.01概率，很少有> 0.5概率。出于某种原因，我希望概率在0和1之间以高斯分布分布。在这种情况下，我想平均值应该是0.5，但如果可能的话，我也希望能够修改这种分布的均值。我想分别对每一行进行此操作，如何使用pandas dataframe进行操作？

Answer 1

如果你想重现更像高斯分布的分布，你就是在讨论单点的重量（连续的分数）。
所以我建议使用Guassian分布式权重来放大分数。
这是一个例子：

import numpy as np
import pandas as pd
#Preparation of the data
nclasses = 10
nsamples = 5
df_c = []
for nc in range( nsamples ):
    a = np.random.rand(nclasses)
    a = [n/np.sum(a) for n in a]
    df_c.append( a )

df = pd.DataFrame(df_c)

# Now let's weight

for nr in range( df[0].count() ): #iterate over rows
    a = df.iloc[nr] #capture the nth row
    #generate Gaussian weights
    gw = np.random.normal( np.mean(a), np.std(a), len(a) )
    #sort gw and a in order to assign one to the other
    gw = np.sort(gw)
    b_ind = np.argsort(a) #indexes to sort a
    b = a[b_ind]          #sorted version of a
    # now weight the row
    aw_r = a*b # you can reduce the entity adding anotherfactor, like 0.8 for instance
    # back from sort
    aw = [ aw_r[n] for n in b_ind ]
    #update the dataframe
    df.iloc[nr] = aw

# there you go!

希望它会有所帮助

更新 __
如果要将每行的平均值调整为相同的值，例如0.5，则只需减去行均值和目标均值之间的差异（在这种情况下为0.5）。

a=np.array([1,2,3,47,2,6])
print( a.mean() ) # 10.1666
target_mean = 0.5

a_adj = a-(np.mean(a) - target_mean)
print( np.mean( a_adj ) ) # 0.5

这意味着在上面的主要例子中，在df.iloc [nr]中替换aw之前你应该做

aw = aw-(np.mean(aw) - 0.5)

使用pd.DataFrame

1 个答案: