根据复杂条件替换df中的单元格值

时间:2019-09-12 13:13:53

标签: pandas

你好朋友

  • 我想遍历df中的所有数字列(以通用方式)。
  • 对于每个数字列中的每个唯一df [“ Type”]组:

    替换所有大于每列平均值+ 2个标准的值 “ nan”的偏差值


df = pd.DataFrame(data=d)
df = pd.DataFrame(data=d)
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']

样本df:

PRODUCT Test1   Test2   Type
A       7       99      Y
B       1       10      X
C       2       13      X
A       5       12      Y
B       1       11      Y
C       90      87      X

预期输出:

RODUCT  Test1   Test2    Type
    A       7       nan     Y
    B       1       10      X
    C       2       13      X
    A       5       12      Y
    B       1       11      Y
    C       nan     nan     X

2 个答案:

答案 0 :(得分:2)

从逻辑上讲,它可以像这样:

test_cols = ['Test1', 'Test2']

# calculate mean and std with groupby
groups = df.groupby('Type')
test_mean = groups[test_cols].transform('mean')
test_std = groups[test_cols].transform('std')

# threshold
thresh = test_mean + 2 * test_std

# thresholding
df[test_cols] = np.where(df[test_cols]>thresh, np.nan, df[test_cols])

但是,从您的示例数据集中,thresh是:

        Test1       Test2
0   10.443434  141.707912
1  133.195890  123.898159
2  133.195890  123.898159
3   10.443434  141.707912
4   10.443434  141.707912
5  133.195890  123.898159

所以,它什么都不会改变。

答案 1 :(得分:1)

您可以通过groupby进行转换并进行以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame()
df['Product'] = ['A', 'B', 'C', 'A', 'B', 'C']
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
df = df.set_index('Product')

def nan_out_values(type_df):
    type_df[type_df > type_df.mean() + 2*type_df.std()] = np.nan
    return type_df

df[['Test1', 'Test2']] = df.groupby('Type').transform(nan_out_values)