pandas.DataFrame中的pandas.Series和列的通用内积

时间:2016-06-08 18:39:44

标签: python pandas

我正在尝试构建一个计算数据帧中条件香农熵的函数。我给它以下参数:

import random
rows = 1000
columns = 3

data=pd.DataFrame([[random.randrange(0, 4, 1) for x in range(columns)] for y in range(rows)], columns=['a', 'b', 'c'])
target = ['a', 'b']
conditional = ['c']

所以在这个例子中我将同时计算H(a | c)和H(b | c)。这是代码:

""" Split the data into groups according to 'c', then
    compute the shannon entropy for each column within each group """

entropy =  data.groupby(conditional)[target].apply(shannon)
print("Entropy type", type(entropy), "\n",entropy.head(), "\n")

""" Then compute a Series with the probability of each value of 'c' """
prob_condition = data.groupby(conditional)[target].apply(len)/len(data)
print("Prob type", type(prob_condition), "\n",prob_condition.head(), "\n")

""" Different ways to compute the mean entropy, weighted 
    by the probability of each occurrence in 'c' """
print(entropy.apply((lambda x: (x * prob_condition))))
print(entropy.apply(lambda x: prob_condition.dot(x)).head(),"\n")

生成输出:

    Entropy type <class 'pandas.core.frame.DataFrame'> 
           a         b
c                    
0  1.992605  1.984517
1  1.987800  1.980181
2  1.979485  1.994622
3  1.990220  1.982847 

Prob type <class 'pandas.core.series.Series'> 
 c
0    0.251
1    0.248
2    0.264
3    0.237
dtype: float64 

Method 1: 
 a    1.987384
b    1.985713
dtype: float64 

Method 2: 
 a    1.987384
b    1.985713
dtype: float64 

现在,如果我的目标只是'a',那么我遇到了麻烦:

target = ['a']

输出为:

Entropy type <class 'pandas.core.series.Series'> 
 c
0    1.992605
1    1.987800
2    1.979485
3    1.990220
dtype: float64 

Prob type <class 'pandas.core.series.Series'> 
 c
0    0.251
1    0.248
2    0.264
3    0.237
dtype: float64 

Method 1: 
 c
0    1.992605
1    1.987800
2    1.979485
3    1.990220
dtype: float64 

Traceback (most recent call last):

  File "<ipython-input-100-d48372bac628>", line 1, in <module>
    runfile('..../snippet.py', wdir='....')

  File "..../anaconda3/lib/python3.5/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "..../anaconda3/lib/python3.5/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "..../snippet.py", line 21, in <module>
    print("Method 2: \n", entropy.apply(lambda x: prob_condition.dot(x)).head(),"\n")

  File "..../anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 2237, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)

  File "pandas/src/inference.pyx", line 1088, in pandas.lib.map_infer (pandas/lib.c:63043)

  File "..../snippet.py", line 21, in <lambda>
    print("Method 2: \n", entropy.apply(lambda x: prob_condition.dot(x)).head(),"\n")

  File "..../anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 1451, in dot
    if lvals.shape[0] != rvals.shape[0]:

IndexError: tuple index out of range

第一种方法没有给出正确答案,因为我理解x * prob_condition计算两个向量的外积,而我需要内积。另一方面,.dot功能失败了,即使我正在喂它两个系列......

我正在寻找一种方法来计算entropy和系列prob_conditional中每列的内积,无论entropy是否为系列(1列),这都应该有效或者一个DataFrame(很多列)。

PS:你可能会问我为什么不做H(a | c)= H(ac)-H(c)。原因是我想要计时,我仍然没有编码“联合”熵。另外,我不会知道你要教我什么:)

**编辑:**我添加了整个shannon函数,以便代码可以运行:

def shannon(data, conditional=None, target=None):
    """ if no target is specified, try to guess it """
    target = [target] if type(target)==str else target
    conditional = [conditional] if type(conditional)==str else conditional

    if target==None and type(data)!=pd.core.series.Series:
        target=list(set(data.keys())) if conditional == None else [var for var in list(set(data.keys())) if var not in conditional]

    """ if there are conditions, split data in groups and apply independently """
    if conditional!=None:
        entropy =  data.groupby(conditional)[target].apply(shannon)
        print("Entropy type", type(entropy), "\n",entropy.head())
        prob_condition = data.groupby(conditional)[target].apply(len)/len(data)
        print("Prob type", type(prob_condition), "\n",prob_condition.head())
        cond_entropy = entropy.apply((lambda x: (x * prob_condition)))
        print(entropy.apply(lambda x: prob_condition.dot(x)).head())
        print(entropy.apply(lambda x: sum(x * prob_condition)).head())
        return cond_entropy if len(cond_entropy)>1 else cond_entropy[0]


    """ if data is a series compute right away """
    if type(data)==pd.core.series.Series:
        prob=data.value_counts()
        prob=prob/prob.sum()
        entropy= - sum([ (p * np.log(p) / np.log(2.0) if p>0 else 0) for p in prob])  
        return entropy

    """ if there are no conditions but several columns, evaluate each column independently """
    entropy = data[target].apply(shannon,axis=0)
    return entropy if len(entropy)>1 else entropy[0]

1 个答案:

答案 0 :(得分:1)

好的,我明白了。按照@BrenBarn的建议,我跟踪了DataFrames和Series的使用。

我遇到案例type(entropy)==Series的问题(当只有一列,target=['a']时),是由于行{{1}中apply函数的意外行为造成的}}。当只使用一列调用Groupby时,entropy = data.groupby(conditional)[target].apply(shannon)会返回一个系列,而documentation表示它将始终返回一个DataFrame(顺便说一下,它并不是非常明确的)。这就是问题,因为随后的apply调用正在为单个元素(单列行)提供计算内部产品,当然这是无法完成的。

我将apply调用替换为Groupby.apply调用,该调用具有相同的行为,并返回DataFrame而不管列数。我不得不说我对后者的lack of documentation感到有些不安。

为了完整起见,我发布了整个函数:

Groupby.aggregate