我正在关注密歇根大学关于Python熊猫数据科学的MOOC,在测试中遇到了一些问题。
我必须使用groupby函数来计算按大陆分组的15个国家的总和,均值,大小和标准差。
问题在于sum(),std()和size()可以正常工作,但不是mean(),我也不知道为什么。
我已经尝试使用dtype=float
指定类型,但是我无法使用。
这是我的代码:
# --------- This part is ok, just describing so you can understand --------- #
Top15 = answer_one() # load top 15 countries with most scientific publications
# list of the continents for the top 15 countries
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
# estimation of the population for each countries
# by calculating the Energy Supply / Energy Supply per Capita
Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15 = Top15[['PopEst']]
Top15.reset_index(inplace = True)
Top15['Continent'] = None
# loop that add the coresponding continent to the country
for country in Top15['Country']:
index_country = ((Top15.loc[Top15['Country'] == country]) # seek country index
.index)
Top15.iloc[index_country,2] = ContinentDict[country] # add continent to country
# ---------- This is the part where I am having problem ---------- #
# create the 'answer' DataFrame
answer = pd.DataFrame(index=['Asia', 'Australia',
'Europe', 'North America',
'South America'],
columns=['size', 'sum', 'mean', 'std'], dtype=float)
grouped = Top15.groupby('Continent') # group countries by continent
answer['size'] = grouped.size()
answer['sum'] = grouped['PopEst'].sum()
answer['mean'] = grouped['PopEst'].mean()
answer['std'] = grouped['PopEst'].std()
我到达answer['mean'] = grouped['PopEst'].mean()
行,错误:
DataError:没有要聚合的数字类型
我不知道问题出在哪里。
PopEst包含数值。例如,中国的人口估计为1.36765e + 09人。
这是answer_one()
返回的DataFrame Top15 ,我必须处理:
Country PopEst Continent
0 Australia 2.3316e+07 Australia
1 Brazil 2.05915e+08 South America
2 Canada 3.52399e+07 North America
3 China 1.36765e+09 Asia
4 France 6.38373e+07 Europe
5 Germany 8.03697e+07 Europe
6 India 1.27673e+09 Asia
7 Iran 7.70756e+07 Asia
8 Italy 5.99083e+07 Europe
9 Japan 1.27409e+08 Asia
10 Russian Federation 1.435e+08 Europe
11 South Korea 4.98054e+07 Asia
12 Spain 4.64434e+07 Europe
13 United Kingdom 6.3871e+07 Europe
14 United States 3.17615e+08 North America
这是Top15.to_dict()
给我的回馈:
{'Country': {0: 'Australia',
1: 'Brazil',
2: 'Canada',
3: 'China',
4: 'France',
5: 'Germany',
6: 'India',
7: 'Iran',
8: 'Italy',
9: 'Japan',
10: 'Russian Federation',
11: 'South Korea',
12: 'Spain',
13: 'United Kingdom',
14: 'United States'},
'PopEst': {0: 23316017.316017315,
1: 205915254.23728815,
2: 35239864.86486486,
3: 1367645161.2903225,
4: 63837349.39759036,
5: 80369696.96969697,
6: 1276730769.2307692,
7: 77075630.25210084,
8: 59908256.880733944,
9: 127409395.97315437,
10: 143500000.0,
11: 49805429.864253394,
12: 46443396.2264151,
13: 63870967.741935484,
14: 317615384.61538464},
'Continent': {0: 'Australia',
1: 'South America',
2: 'North America',
3: 'Asia',
4: 'Europe',
5: 'Europe',
6: 'Asia',
7: 'Asia',
8: 'Europe',
9: 'Asia',
10: 'Europe',
11: 'Asia',
12: 'Europe',
13: 'Europe',
14: 'North America'}}
答案 0 :(得分:1)
这是Pandas
的错误,即使数据不是数字,Pandas仍会在groupby中进行求和和产品计算。我检查了源代码,该错误出现在site-packages\pandas\core\groupby\groupby.py
的{{3}}中。它写道:
except Exception:
pass
如果打印错误,您可能还会发现“没有要聚合的数字类型”。
作为一种解决方案,您可以使用以下方法将数据更改为数值形式:
df['column'] = pd.to_numeric(df['column'])
某些帖子可能会告诉您在errors='coerce'
内添加pd.to_numeric
,以便将非数字元素替换为na
,并且不会引发错误。但是,在许多情况下,这意味着数据中存在一些错误。我们需要修复数据而不是消除错误。