Question

我正在关注密歇根大学关于Python熊猫数据科学的MOOC，在测试中遇到了一些问题。

我必须使用groupby函数来计算按大陆分组的15个国家的总和，均值，大小和标准差。

问题在于sum（），std（）和size（）可以正常工作，但不是mean（），我也不知道为什么。

我已经尝试使用dtype=float指定类型，但是我无法使用。

这是我的代码：

# --------- This part is ok, just describing so you can understand --------- #
Top15 = answer_one() # load top 15 countries with most scientific publications

# list of the continents for the top 15 countries
ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}

# estimation of the population for each countries 
# by calculating the Energy Supply / Energy Supply per Capita
Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15 = Top15[['PopEst']]

Top15.reset_index(inplace = True)
Top15['Continent'] = None

# loop that add the coresponding continent to the country
for country in Top15['Country']:
    index_country = ((Top15.loc[Top15['Country'] == country]) # seek country index
                           .index)
    Top15.iloc[index_country,2] = ContinentDict[country] # add continent to country


# ---------- This is the part where I am having problem ---------- #
# create the 'answer' DataFrame
answer = pd.DataFrame(index=['Asia', 'Australia', 
                             'Europe', 'North America', 
                             'South America'], 
                      columns=['size', 'sum', 'mean', 'std'], dtype=float)

grouped = Top15.groupby('Continent')      # group countries by continent

answer['size'] = grouped.size()
answer['sum'] = grouped['PopEst'].sum()
answer['mean'] = grouped['PopEst'].mean()
answer['std'] = grouped['PopEst'].std()

我到达answer['mean'] = grouped['PopEst'].mean()行，错误：

DataError：没有要聚合的数字类型

我不知道问题出在哪里。

PopEst包含数值。例如，中国的人口估计为1.36765e + 09人。

这是answer_one()返回的DataFrame Top15 ，我必须处理：

    Country             PopEst      Continent  
0   Australia           2.3316e+07  Australia
1   Brazil              2.05915e+08 South America
2   Canada              3.52399e+07 North America
3   China               1.36765e+09 Asia
4   France              6.38373e+07 Europe
5   Germany             8.03697e+07 Europe
6   India               1.27673e+09 Asia
7   Iran                7.70756e+07 Asia
8   Italy               5.99083e+07 Europe
9   Japan               1.27409e+08 Asia
10  Russian Federation  1.435e+08   Europe
11  South Korea         4.98054e+07 Asia
12  Spain               4.64434e+07 Europe
13  United Kingdom      6.3871e+07  Europe
14  United States       3.17615e+08 North America

这是Top15.to_dict()给我的回馈：

{'Country': {0: 'Australia',
  1: 'Brazil',
  2: 'Canada',
  3: 'China',
  4: 'France',
  5: 'Germany',
  6: 'India',
  7: 'Iran',
  8: 'Italy',
  9: 'Japan',
  10: 'Russian Federation',
  11: 'South Korea',
  12: 'Spain',
  13: 'United Kingdom',
  14: 'United States'},
 'PopEst': {0: 23316017.316017315,
  1: 205915254.23728815,
  2: 35239864.86486486,
  3: 1367645161.2903225,
  4: 63837349.39759036,
  5: 80369696.96969697,
  6: 1276730769.2307692,
  7: 77075630.25210084,
  8: 59908256.880733944,
  9: 127409395.97315437,
  10: 143500000.0,
  11: 49805429.864253394,
  12: 46443396.2264151,
  13: 63870967.741935484,
  14: 317615384.61538464},
 'Continent': {0: 'Australia',
  1: 'South America',
  2: 'North America',
  3: 'Asia',
  4: 'Europe',
  5: 'Europe',
  6: 'Asia',
  7: 'Asia',
  8: 'Europe',
  9: 'Asia',
  10: 'Europe',
  11: 'Asia',
  12: 'Europe',
  13: 'Europe',
  14: 'North America'}}

Answer 1

这是Pandas的错误，即使数据不是数字，Pandas仍会在groupby中进行求和和产品计算。我检查了源代码，该错误出现在site-packages\pandas\core\groupby\groupby.py的{{3}}中。它写道：

                except Exception:
                    pass

如果打印错误，您可能还会发现“没有要聚合的数字类型”。

作为一种解决方案，您可以使用以下方法将数据更改为数值形式：

df['column'] = pd.to_numeric(df['column'])

某些帖子可能会告诉您在errors='coerce'内添加pd.to_numeric，以便将非数字元素替换为na，并且不会引发错误。但是，在许多情况下，这意味着数据中存在一些错误。我们需要修复数据而不是消除错误。

groupby.mean（）不起作用，而sum（），std（）和size（）都起作用

1 个答案: