Question

我有一个老函数的问题，计算大熊猫分类列的浓度。似乎已经发生了变化，因此无法对分类序列的.value_counts()方法的结果进行子集化。

最小的非工作示例：

import pandas as pd
import numpy as np

df = pd.DataFrame({"A":["a","b","c","a"]})

def get_concentration(df,cat):
    tmp = df[cat].astype("category")
    counts = tmp.value_counts()
    obs = len(tmp)
    all_cons = []
    for key in counts.keys():
        single = np.square(np.divide(float(counts[key]),float(obs)))
        all_cons.append(single)
        return np.sum(all_cons)

get_concentration(df, "A")

这会导致counts["a"]出现关键错误。我很确定这在大熊猫的过去版本中有效，并且文档似乎没有提及有关.value_counts()方法的更改。

Answer 1

让我们就方法论达成一致：

>>> df.A.value_counts()
a    2
b    1
c    1

obs = len((df['A'].astype('category'))
>>> obs
4

浓度应如下（按Herfindahl Index）：

>>> (2 / 4.) ** 2 + (1 / 4.) ** 2 + (1 / 4.) ** 2
0.375

相当于（Pandas 0.17 +）：

>>> ((df.A.value_counts() / df.A.count()) ** 2).sum()
0.375

如果你真的想要一个功能：

def concentration(df, col):
    return ((df[col].value_counts() / df[col].count()) ** 2).sum()

>>> concentration(df, 'A')
0.375

Answer 2

要修复当前功能，您只需使用index访问.ix值（参见下文）。你最好使用矢量化函数 - 我最后加了一个函数。

df = pd.DataFrame({"A":["a","b","c","a"]})

tmp = df[cat].astype('category')
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.index:
    single = np.square(np.divide(float(counts.ix[key]), float(obs)))
    all_cons.append(single)
    return np.sum(all_cons)

的产率：

get_concentration(df, "A")

0.25

您可能想要尝试一个矢量化版本，它也不一定需要category dtype，例如：

def get_concentration(df, cat):
    counts = df[cat].value_counts()
    return counts.div(len(counts)).pow(2).sum()

Answer 3

由于您在循环中进行迭代（而不是以矢量方式工作），因此您可能只需显式迭代对。它简化了语法，恕我直言：

import pandas as pd
import numpy as np

df = pd.DataFrame({"A":["a","b","c","a"]})

def get_concentration(df,cat):
    tmp = df[cat].astype("category")
    counts = tmp.value_counts()
    obs = len(tmp)
    all_cons = []
    # See change in following line - you're anyway iterating 
    #    over key-value pairs; why not do so explicitly?
    for k, v in counts.to_dict().items():
        single = np.square(np.divide(float(v),float(obs)))
        all_cons.append(single)
        return np.sum(all_cons)

>>> get_concentration(df, "A")
0.25

计算＆＃34;浓度＆＃34;大熊猫分类

3 个答案: