Question

所以我的数据集输出为：

gdp = pd.read_csv(r"gdpproject.csv", 
encoding="ISO-8859-1")
gdp.head(2)
gdp.tail(2)

这给了我输出：

Country.Name    Indicator.Name  2004    2005    
0   World   GDP 5.590000e+13    5.810000e+13
1   World   Health  5.590000e+13    5.810000e+13
086 Zimbabwe    GDP per capita  8.681564e+02    8.082944e+02
089 Zimbabwe    Population  1.277751e+07    1.294003e+07

因此，您立即注意到每个国家/地区都有多个指标。

我要做的是从两个当前指标创建一个新指标。并为每个独特的国家创建它。

for i in series(gdp['Country.Name']):
gdp['Military Spending'] = 100 / gdp['Military percent of GDP'] * 
gdp['GDP']

它给了我这个错误信息：

NameError                                 Traceback (most recent call last)
<ipython-input-37-d817ea1522fc> in <module>()
----> 1 for i in series(gdp1['Country.Name']):
  2     gdp1['Military Spending'] = 100 / gdp1['Military percent of GDP'] * 
gdp1['GDP']

NameError: name 'series' is not defined

如何让这个系列工作？我也尝试过简单

for i in gdp['Country.Name']

但仍然收到错误消息。

请帮忙！

Answer 1

假设您有以下输入Dataframe（请注意，在您的示例中数据Military percent of GDP不存在）：

>>> gdp
  Country.Name           Indicator.Name          2004          2005
0        World                      GDP  5.590000e+13  5.810000e+13
1        World  Military percent of GDP  2.100000e+00  2.300000e+00
2     Zimbabwe                      GDP  1.628900e+10  1.700000e+10
3     Zimbabwe  Military percent of GDP  2.000000e+00  2.100000e+00

然后，您可以分别使用df_gdp和df_mpgdp的数据为2004和2005创建辅助数据框GDP和Military percent of GDP。然后，您可以创建df_msp，其中包含名为Indicator.Name的新Military Spending，最后将其结果附加到原始Dataframe。请注意，在某些情况下我们需要reset_index以确保使用预期索引完成计算。

下面的代码应该适用于您的目标：

import pandas as pd
gdp = pd.DataFrame( [
["World",  "GDP",  5.590000e+13,  5.810000e+13], 
["World",  "Military percent of GDP",  2.1, 2.3], 
["Zimbabwe",  "GDP",  16289e6, 17000e6], 
["Zimbabwe",  "Military percent of GDP",  2, 2.1]])
gdp.columns = ["Country.Name", "Indicator.Name", "2004", "2005"]

df_gdp = gdp[gdp["Indicator.Name"] == "GDP"]
df_mpgdp = gdp[gdp["Indicator.Name"] == "Military percent of GDP"]

df_msp = pd.DataFrame()
df_msp["Country.Name"] = df_gdp["Country.Name"].reset_index(drop=True)
df_msp["Indicator.Name"] = "Military Spending"
df_msp["2004"] = 100 / df_mpgdp[["2004"]].reset_index(drop=True) *  df_gdp[["2004"]].reset_index(drop=True)
df_msp["2005"] = 100 / df_mpgdp[["2005"]].reset_index(drop=True) *  df_gdp[["2005"]].reset_index(drop=True)

gdp_out = gdp.append(df_msp)
gdp_out = gdp_out.sort_values(["Country.Name", "Indicator.Name"])
gdp_out = gdp_out.reset_index(drop=True)

最后输出Dataframe会导致：

>>> gdp_out
  Country.Name           Indicator.Name          2004          2005
0        World                      GDP  5.590000e+13  5.810000e+13
1        World        Military Spending  2.661905e+15  2.526087e+15
2        World  Military percent of GDP  2.100000e+00  2.300000e+00
3     Zimbabwe                      GDP  1.628900e+10  1.700000e+10
4     Zimbabwe        Military Spending  8.144500e+11  8.095238e+11
5     Zimbabwe  Military percent of GDP  2.000000e+00  2.100000e+00

使用带有重复项的系列创建新变量

1 个答案: