我有一个老函数的问题,计算大熊猫分类列的浓度。似乎已经发生了变化,因此无法对分类序列的.value_counts()
方法的结果进行子集化。
最小的非工作示例:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":["a","b","c","a"]})
def get_concentration(df,cat):
tmp = df[cat].astype("category")
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.keys():
single = np.square(np.divide(float(counts[key]),float(obs)))
all_cons.append(single)
return np.sum(all_cons)
get_concentration(df, "A")
这会导致counts["a"]
出现关键错误。我很确定这在大熊猫的过去版本中有效,并且文档似乎没有提及有关.value_counts()
方法的更改。
答案 0 :(得分:2)
让我们就方法论达成一致:
>>> df.A.value_counts()
a 2
b 1
c 1
obs = len((df['A'].astype('category'))
>>> obs
4
浓度应如下(按Herfindahl Index):
>>> (2 / 4.) ** 2 + (1 / 4.) ** 2 + (1 / 4.) ** 2
0.375
相当于(Pandas 0.17 +):
>>> ((df.A.value_counts() / df.A.count()) ** 2).sum()
0.375
如果你真的想要一个功能:
def concentration(df, col):
return ((df[col].value_counts() / df[col].count()) ** 2).sum()
>>> concentration(df, 'A')
0.375
答案 1 :(得分:1)
要修复当前功能,您只需使用index
访问.ix
值(参见下文)。你最好使用矢量化函数 - 我最后加了一个函数。
df = pd.DataFrame({"A":["a","b","c","a"]})
tmp = df[cat].astype('category')
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.index:
single = np.square(np.divide(float(counts.ix[key]), float(obs)))
all_cons.append(single)
return np.sum(all_cons)
的产率:
get_concentration(df, "A")
0.25
您可能想要尝试一个矢量化版本,它也不一定需要category
dtype
,例如:
def get_concentration(df, cat):
counts = df[cat].value_counts()
return counts.div(len(counts)).pow(2).sum()
答案 2 :(得分:1)
由于您在循环中进行迭代(而不是以矢量方式工作),因此您可能只需显式迭代对。它简化了语法,恕我直言:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":["a","b","c","a"]})
def get_concentration(df,cat):
tmp = df[cat].astype("category")
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
# See change in following line - you're anyway iterating
# over key-value pairs; why not do so explicitly?
for k, v in counts.to_dict().items():
single = np.square(np.divide(float(v),float(obs)))
all_cons.append(single)
return np.sum(all_cons)
>>> get_concentration(df, "A")
0.25