Question

我的数据集如下：

ID   |    country
1    |    USA
2    |    USA
3    |    Zimbabwe
4    |    Germany

我执行以下操作以获取第一个国家的名称及其对应的值。所以在我的情况下是：

df.groupby(['country']).country.value_counts().nlargest(5).index[0]
df.groupby(['country']).country.value_counts().nlargest(5)[0]
df.groupby(['country']).country.value_counts().nlargest(5).index[1]
df.groupby(['country']).country.value_counts().nlargest(5)[1]
etc.

，输出将是：

(USA), 388
(DEU), 245
etc.

然后重复一遍，直到获得数据集中的前5个国家/地区。

但是，如何获得“其他”或“其他”列，以便将所有其他国家/地区合并在一起。因此，以下国家/地区在我的数据集中并不常见：

津巴布韦，伊拉克，马来西亚，肯尼亚，澳大利亚等

所以我想要第六个值，其输出如下所示：

（其他），3728

如何在熊猫中实现这一目标？

Answer 1

使用：

N = 5
#get counts of column
s = df.country.value_counts()
#select top 5 values
out = s.iloc[:N]
#add sum of another values
out.loc['Other'] = s.iloc[N:].sum()

如果需要2列DataFrame，最后一次：

df = out.reset_index()
df.columns=['country','count']

Answer 2

使用SET(CMAKE_C_COMPILER "armcc")，在{em> before 之前用find_program(CMAKE_C_COMPILER NAMES armcc)代替不太频繁的国家。一种有效的方法是通过Categorical Data。如果您想保留原始数据，则可以使用副本，例如'Other'。

value_counts

然后提取国家/地区及其计数：

new_country_series = df['country'].copy()

选取最大的5，然后将其余的总和以熊猫为单位

2 个答案: