计算"浓度"大熊猫分类

时间:2016-02-02 14:52:16

标签: python pandas

我有一个老函数的问题,计算大熊猫分类列的浓度。似乎已经发生了变化,因此无法对分类序列的.value_counts()方法的结果进行子集化。

最小的非工作示例:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A":["a","b","c","a"]})

def get_concentration(df,cat):
    tmp = df[cat].astype("category")
    counts = tmp.value_counts()
    obs = len(tmp)
    all_cons = []
    for key in counts.keys():
        single = np.square(np.divide(float(counts[key]),float(obs)))
        all_cons.append(single)
        return np.sum(all_cons)

get_concentration(df, "A")

这会导致counts["a"]出现关键错误。我很确定这在大熊猫的过去版本中有效,并且文档似乎没有提及有关.value_counts()方法的更改。

3 个答案:

答案 0 :(得分:2)

让我们就方法论达成一致:

>>> df.A.value_counts()
a    2
b    1
c    1

obs = len((df['A'].astype('category'))
>>> obs
4

浓度应如下(按Herfindahl Index):

>>> (2 / 4.) ** 2 + (1 / 4.) ** 2 + (1 / 4.) ** 2
0.375

相当于(Pandas 0.17 +):

>>> ((df.A.value_counts() / df.A.count()) ** 2).sum()
0.375

如果你真的想要一个功能:

def concentration(df, col):
    return ((df[col].value_counts() / df[col].count()) ** 2).sum()

>>> concentration(df, 'A')
0.375

答案 1 :(得分:1)

要修复当前功能,您只需使用index访问.ix值(参见下文)。你最好使用矢量化函数 - 我最后加了一个函数。

df = pd.DataFrame({"A":["a","b","c","a"]})

tmp = df[cat].astype('category')
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.index:
    single = np.square(np.divide(float(counts.ix[key]), float(obs)))
    all_cons.append(single)
    return np.sum(all_cons)

的产率:

get_concentration(df, "A")

0.25

您可能想要尝试一个矢量化版本,它也不一定需要category dtype,例如:

def get_concentration(df, cat):
    counts = df[cat].value_counts()
    return counts.div(len(counts)).pow(2).sum()

答案 2 :(得分:1)

由于您在循环中进行迭代(而不是以矢量方式工作),因此您可能只需显式迭代对。它简化了语法,恕我直言:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A":["a","b","c","a"]})

def get_concentration(df,cat):
    tmp = df[cat].astype("category")
    counts = tmp.value_counts()
    obs = len(tmp)
    all_cons = []
    # See change in following line - you're anyway iterating 
    #    over key-value pairs; why not do so explicitly?
    for k, v in counts.to_dict().items():
        single = np.square(np.divide(float(v),float(obs)))
        all_cons.append(single)
        return np.sum(all_cons)

>>> get_concentration(df, "A")
0.25