Question

我正在使用具有以下形状的数据集：

tconst  GreaterEuropean British WestEuropean    Italian French  Jewish  Germanic    Nordic  Asian   GreaterEastAsian    Japanese    Hispanic    GreaterAfrican  Africans    EastAsian   Muslim  IndianSubContinent  total_ethnicities
0   tt0000001   3   1   2   0   1   0   0   1   0   0   0   0   0   0   0   0   0   8
1   tt0000002   2   0   2   0   2   0   0   0   0   0   0   0   0   0   0   0   0   6
2   tt0000003   4   0   3   0   3   1   0   0   0   0   0   0   0   0   0   0   0   11
3   tt0000004   2   0   2   0   2   0   0   0   0   0   0   0   0   0   0   0   0   6
4   tt0000005   3   2   1   0   0   0   1   0   0   0   0   0   0   0   0   0   0   7

这是IMDB数据，经过处理后，我创建了这些列，表示电影中有很多民族演员（tcons）。

我想创建另一个列df["diversity"]，即：

( diversity score "gini index")

例如：对于每部电影，我们说有10个演员; 3亚洲人，3英国人，3非洲裔美国人和1法国人。所以我们除以总数 3/10 3/10 3/10 1/10 然后1减去（3/10）平方（3/10）平方（3/10）平方（1/10）平方的总和将每个角色的得分添加到列中作为多样性。

我正在尝试简单的pandas操作，但没有到达那里。

编辑：

第一行，我们的种族总数为8

3 GreaterEuropean
1 British
2 WestEuropean
1 French
1 nordic

所以得分

1- [（3/8）^ 2 +（1/8）^ 2 +（2/8）^ 2 +（1/8）^ 2 +（1/8）^ 2]

Answer 1

你可以在这里使用numpy矢量化，即

one = df.drop(['total_ethnicities'],1).values
# Select the values other than total_ethnicities
two = df['total_ethnicities'].values[:,None]
# Select the values of total_ethnicities
df['diversity'] = 1 - pd.np.sum((one/two)**2, axis=1)
# Divide the values of one by two, square them. Sum over the axis. Then subtract from 1. 
df['diversity']

tconst
tt0000001    0.750000
tt0000002    0.666667
tt0000003    0.710744
tt0000004    0.666667
tt0000005    0.693878
Name: diversity, dtype: float64

Answer 2

df2 = df.set_index('tconst')
total = df2.pop('total_ethnicities')
result = 1 - ((df2** 2 ).div(total**2, axis=0)).sum(axis=1)
result.name = 'gini'

tconst
tt0000001    0.750000
tt0000002    0.666667
tt0000003    0.710744
tt0000004    0.666667
tt0000005    0.693878
Name: gini, dtype: float64

除此之外，我总是尝试将原始数据与我的解析数据分开，因此我会将列total_etnicities保留在一个单独的系列中，并且只有在需要报告结果时我才能将它们

如果您真的希望将此结果作为df中的额外列，则可以通过以下方式执行此操作：

df = df.join(result, on='tconst')

Answer 3

执行此操作的最佳方法是将所有列与给定列进行比较，因为基尼系数定义了分布的差异。您将生成比较分布的基尼系数，例如意大利语，法语，犹太语。然后，与给定的列进行比较，您甚至可以将这些种族分组为相似分布的群集。

假设df2是您的数据框。基尼指数公式为：

您在Pandas中选择了枢轴列（place_y）：

place_y=df2.columns.get_loc("price_doc")

gini=[]
for i in range(0,df2.shape[1]):
    gini.append((df2.shape[0]+1-2*(np.sum((df2.shape[0]+1-df2.ix[:,i])*df2.ix[:,place_y])/np.sum(df2.ix[:,place_y])))/df2.shape[0])

然后你选择最符合你的阈值的列，让我们假设0.2，最相似的分布：

np.where(np.array(np.abs(gini))<.2)[0]

在您的情况下，似乎您要比较示例（行）而不是功能（列）以生成新列。这是相同的理性，转换。在您的枢轴行中，基尼系数将为零，而其他所有系数都将具有系数。

Python - Pandas数据操作来计算Gini系数

3 个答案: