python中的WOE和IV表

时间:2018-05-15 14:00:05

标签: python python-3.x pandas

我有一个计算WOE和IV的函数如下:

def calc_iv(df, feature, target, pr=0):

    lst = []

    for i in range(df[feature].nunique()):
        val = list(df[feature].unique())[i]
        lst.append([feature, val, df[df[feature] == val].count()[feature], df[(df[feature] == val) & (df[target] == 1)].count()[feature]])

    data = pd.DataFrame(lst, columns=['Variable', 'Value', 'All', 'Bad'])
    data = data[data['Bad'] > 0]

    data['Share'] = data['All'] / data['All'].sum()
    data['Bad Rate'] = data['Bad'] / data['All']
    data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
    data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
    data['grp_score'] = round((data['Distribution Good']/(data['Distribution Good'] + data['Distribution Bad']))*10, 2)
    data['WoE'] = np.log(data['Distribution Good'] / data['Distribution Bad'])
    data['IV'] = (data['WoE'] * (data['Distribution Good'] - data['Distribution Bad'])).sum()
    data['Efficiency'] =  abs(data['Distribution Good'] - data['Distribution Bad'])/2  
    data = data.sort_values(by=['Variable', 'Value'], ascending=True)

    d = {data['Distribution Good'],data['Distribution Bad'],data['Share'],
         data['Bad Rate'],data['grp_score'],data['WoE'],data['IV'],data['Efficiency']}

    mydf=pd.DataFrame(data=d)

    if pr == 1:
        print(data)

    #return data['IV'].values[0]
    return mydf.values

该函数检查数据帧(dat),如下所示

myvar1 myvar2  myvar3  myvar4  target
 0       50     1000    7800     1
10       87     500     10000    0
35       0      3000    20000    0

然后我调用下面的函数

calc_iv(dat, 'myvar1', 'target', pr=0)

我希望函数返回myvar1

Distribution Good Distribution Bad Share Bad Rate grp_score WoE IV    Efficiency
 0.1                   0.9          1        0.9     20      0.2  0.6    0.8
 0.8                   0.2          2        0.2     10      0.1  0.2    0.1
 0.7                   0.3          3        0.3     70      0.7  0.8    0.5

但我得到以下错误

TypeError: 'Series' objects are mutable, thus they cannot be hashed

1 个答案:

答案 0 :(得分:0)

嗯,这已经有一段时间了。但是,对于任何遇到此问题的人。主要是因为此代码。

d = {data['Distribution Good'],data['Distribution Bad'],data['Share'],
         data['Bad Rate'],data['grp_score'],data['WoE'],data['IV'],data['Efficiency']}

引发异常本身是因为类Series扩展了NDFrame,不允许对其进行哈希处理(如源代码here所示)

最简单的方法就是选择像这样的数据

d = data[[
      'Distribution Good',
      'Distribution Bad',
      'Share',
      'Bad Rate',
      'grp_score',
      'WoE',
      'IV',
      'Efficiency'
  ]]

另一方面,如果OP希望获得那三行的所有结果。您可能要删除此“过滤器”行。

data = data[data['Bad'] > 0]