df的示例子集:
Category Weight Test
1/21/2017 SuperMarket 0.02 Nan
1/21/2017 SuperMarket 0.18 Nan
1/21/2017 SuperMarket 0.71 Nan
1/21/2017 Hotel 0.53 Nan
1/21/2017 Hotel 0.93 0.93
1/21/2017 Hotel 0.97 Nan
1/21/2017 Bar 0.13 Nan
1/21/2017 Bar 0.31 Nan
1/21/2017 Bar 0.96 Nan
1/21/2017 Bar 0.65 0.65
1/21/2017 Bar 0.27 0.27
1/21/2017 Bar 0.24 Nan
1/21/2017 Hospital 0.65 0.65
1/21/2017 Hospital 0.90 0.90
1/21/2017 Hospital 1.00 1.00
新列df['Adjusted_weight']
将根据3个条件分配值:
df['Test']
仅包含Nans,则df['Adjusted_weight'] = df['weight']
df['Test']
仅包含值(无Nans),则df['Adjusted_weight'] = df['weight']
df['Test']
包含值和Nans,则: i)对于df['Test'] = Nan
,df['Adjusted_weight'] = df['weight'] * 0.5
ii)对于df['Test'] = value
的位置,然后df['Adjusted_weight'] = df['weight'] + SUM (df['weight'] - df['adjusted_weight'] )
/ number of non Nan values
个非Nans的数量为NaN的*。
在第ii)部分中,我们正在调整调整后的权重,其中有值,以便调整后的权重(第3部分)的总和等于权重之和(对于特定日期和类别)
示例输出:
Category Weight Test Adjusted Weight
1/21/2017 SuperMarket 0.02 Nan 0.02
1/21/2017 SuperMarket 0.18 Nan 0.18
1/21/2017 SuperMarket 0.71 Nan 0.71
1/21/2017 Hotel 0.53 Nan 0.265
1/21/2017 Hotel 0.93 0.93 1.68
1/21/2017 Hotel 0.97 Nan 0.485
1/21/2017 Bar 0.13 Nan 0.07
1/21/2017 Bar 0.31 Nan 0.16
1/21/2017 Bar 0.96 Nan 0.48
1/21/2017 Bar 0.65 0.65 1.06
1/21/2017 Bar 0.27 0.27 0.68
1/21/2017 Bar 0.24 Nan 0.12
1/21/2017 Hospital 0.65 0.65 0.65
1/21/2017 Hospital 0.90 0.90 0.90
1/21/2017 Hospital 1.00 1.00 1.00
我为 1/21/2007 填充酒店的示例。哪里有2个Nans和1个值。因此,对于2 Nans,在调整后的重量,它只是df['weight'] * 0.5
。
现在有一个值,只有0.93 + (0.53 - 0.265) + (0.97 - 0.485)
= 1.68
。
只是添加了部分
对于 Bar 的示例,有4个Nan值df['Adjusted weight'] = 0.5* df['weight']
。现在,1/21/2017 Bar有两个值。它们都需要将权重添加到df ['adjusted_weight'],以便总和等于1/21/2017 Bar的df ['weight']。因此,计算为(0.13 -0.07)+(0.31-0.16)+(0.96-0.48)+(0.24-0.12)= 0.82,因为有两个值要分配,0.41将被添加到0.65和0.27等于1.06和0.68。
我们可能有任何数量的Nans和值,或者只有Nans和只有值。
基本目标是在有价值的日期和类别中进行扩展,并确保该框中的权重(日期,类别)与以前相同。
我有很多日期,数据比显示的要大得多。感谢。
答案 0 :(得分:1)
您可以定义在分组后传递给apply
的功能,以完成所有计算。
def f(x):
count = x.Test.count()
size = x.Test.size
if count == 0 or count == size:
return x.Weight
else:
adj_null = x.Weight * x.Test.isnull() * .5
notnull = x.Test.notnull()
distribute = adj_null.sum() / notnull.sum()
adj_notnull = (x.Weight + distribute) * notnull
return adj_null + adj_notnull
df['Adjusted Weight'] = df.groupby([pd.TimeGrouper('D'), 'Category'], sort=False).apply(f).values
Category Weight Test Adjusted Weight
2017-01-21 SuperMarket 0.02 NaN 0.020
2017-01-21 SuperMarket 0.18 NaN 0.180
2017-01-21 SuperMarket 0.71 NaN 0.710
2017-01-21 Hotel 0.53 NaN 0.265
2017-01-21 Hotel 0.93 0.93 1.680
2017-01-21 Hotel 0.97 NaN 0.485
2017-01-21 Bar 0.13 NaN 0.065
2017-01-21 Bar 0.31 NaN 0.155
2017-01-21 Bar 0.96 NaN 0.480
2017-01-21 Bar 0.65 0.65 1.060
2017-01-21 Bar 0.27 0.27 0.680
2017-01-21 Bar 0.24 NaN 0.120
2017-01-21 Hospital 0.65 0.65 0.650
2017-01-21 Hospital 0.90 0.90 0.900
2017-01-21 Hospital 1.00 1.00 1.000