基于1条件的新列使用索引和一列groupby

时间:2017-02-07 21:37:34

标签: python pandas dataframe group-by conditional-statements

df的示例子集:

                Category    Weight  Test
1/21/2017       SuperMarket 0.02    Nan
1/21/2017       SuperMarket 0.18    Nan
1/21/2017       SuperMarket 0.71    Nan
1/21/2017       Hotel       0.53    Nan
1/21/2017       Hotel       0.93    0.93
1/21/2017       Hotel       0.97    Nan
1/21/2017       Bar         0.13    Nan
1/21/2017       Bar         0.31    Nan
1/21/2017       Bar         0.96    Nan
1/21/2017       Bar         0.65    0.65
1/21/2017       Bar         0.27    0.27
1/21/2017       Bar         0.24    Nan
1/21/2017       Hospital    0.65    0.65
1/21/2017       Hospital    0.90    0.90
1/21/2017       Hospital    1.00    1.00

新列df['Adjusted_weight']将根据3个条件分配值:

  1. 如果任何日期和类别df['Test']仅包含Nans,则df['Adjusted_weight'] = df['weight']
  2. 如果任何日期和类别df['Test']仅包含值(无Nans),则df['Adjusted_weight'] = df['weight']
  3. 最后,如果对于任何日期和类别,如果df['Test']包含值和Nans,则:
  4. i)对于df['Test'] = Nandf['Adjusted_weight'] = df['weight'] * 0.5

    ii)对于df['Test'] = value的位置,然后df['Adjusted_weight'] = df['weight'] + SUM (df['weight'] - df['adjusted_weight'] ) / number of non Nan values个非Nans的数量为NaN的*。

    在第ii)部分中,我们正在调整调整后的权重,其中有值,以便调整后的权重(第3部分)的总和等于权重之和(对于特定日期和类别)

    示例输出:

                    Category    Weight  Test    Adjusted Weight
    1/21/2017       SuperMarket 0.02    Nan        0.02
    1/21/2017       SuperMarket 0.18    Nan        0.18
    1/21/2017       SuperMarket 0.71    Nan        0.71
    1/21/2017       Hotel       0.53    Nan        0.265
    1/21/2017       Hotel       0.93    0.93       1.68
    1/21/2017       Hotel       0.97    Nan        0.485
    1/21/2017       Bar         0.13    Nan        0.07
    1/21/2017       Bar         0.31    Nan        0.16
    1/21/2017       Bar         0.96    Nan        0.48
    1/21/2017       Bar         0.65    0.65       1.06
    1/21/2017       Bar         0.27    0.27       0.68
    1/21/2017       Bar         0.24    Nan        0.12
    1/21/2017       Hospital    0.65    0.65       0.65
    1/21/2017       Hospital    0.90    0.90       0.90
    1/21/2017       Hospital    1.00    1.00       1.00
    

    我为 1/21/2007 填充酒店的示例。哪里有2个Nans和1个值。因此,对于2 Nans,在调整后的重量,它只是df['weight'] * 0.5

    现在有一个值,只有0.93 + (0.53 - 0.265) + (0.97 - 0.485) = 1.68

    只是添加了部分

    对于 Bar 的示例,有4个Nan值df['Adjusted weight'] = 0.5* df['weight']。现在,1/21/2017 Bar有两个值。它们都需要将权重添加到df ['adjusted_weight'],以便总和等于1/21/2017 Bar的df ['weight']。因此,计算为(0.13 -0.07)+(0.31-0.16)+(0.96-0.48)+(0.24-0.12)= 0.82,因为有两个值要分配,0.41将被添加到0.65和0.27等于1.06和0.68。

    我们可能有任何数量的Nans和值,或者只有Nans和只有值。

    基本目标是在有价值的日期和类别中进行扩展,并确保该框中的权重(日期,类别)与以前相同。

    我有很多日期,数据比显示的要大得多。感谢。

1 个答案:

答案 0 :(得分:1)

您可以定义在分组后传递给apply的功能,以完成所有计算。

def f(x):
    count = x.Test.count()
    size = x.Test.size
    if count == 0 or count == size:
        return x.Weight
    else:
        adj_null = x.Weight * x.Test.isnull() * .5
        notnull = x.Test.notnull()
        distribute = adj_null.sum() / notnull.sum()
        adj_notnull = (x.Weight + distribute) * notnull
        return adj_null + adj_notnull

df['Adjusted Weight'] = df.groupby([pd.TimeGrouper('D'), 'Category'], sort=False).apply(f).values

               Category  Weight  Test  Adjusted Weight
2017-01-21  SuperMarket    0.02   NaN            0.020
2017-01-21  SuperMarket    0.18   NaN            0.180
2017-01-21  SuperMarket    0.71   NaN            0.710
2017-01-21        Hotel    0.53   NaN            0.265
2017-01-21        Hotel    0.93  0.93            1.680
2017-01-21        Hotel    0.97   NaN            0.485
2017-01-21          Bar    0.13   NaN            0.065
2017-01-21          Bar    0.31   NaN            0.155
2017-01-21          Bar    0.96   NaN            0.480
2017-01-21          Bar    0.65  0.65            1.060
2017-01-21          Bar    0.27  0.27            0.680
2017-01-21          Bar    0.24   NaN            0.120
2017-01-21     Hospital    0.65  0.65            0.650
2017-01-21     Hospital    0.90  0.90            0.900
2017-01-21     Hospital    1.00  1.00            1.000