使用带有重复项的系列创建新变量

时间:2017-10-14 18:27:59

标签: python pandas duplicates time-series

所以我的数据集输出为:

gdp = pd.read_csv(r"gdpproject.csv", 
encoding="ISO-8859-1")
gdp.head(2)
gdp.tail(2)

这给了我输出:

Country.Name    Indicator.Name  2004    2005    
0   World   GDP 5.590000e+13    5.810000e+13
1   World   Health  5.590000e+13    5.810000e+13
086 Zimbabwe    GDP per capita  8.681564e+02    8.082944e+02
089 Zimbabwe    Population  1.277751e+07    1.294003e+07

因此,您立即注意到每个国家/地区都有多个指标。

我要做的是从两个当前指标创建一个新指标。并为每个独特的国家创建它。

for i in series(gdp['Country.Name']):
gdp['Military Spending'] = 100 / gdp['Military percent of GDP'] * 
gdp['GDP']

它给了我这个错误信息:

NameError                                 Traceback (most recent call last)
<ipython-input-37-d817ea1522fc> in <module>()
----> 1 for i in series(gdp1['Country.Name']):
  2     gdp1['Military Spending'] = 100 / gdp1['Military percent of GDP'] * 
gdp1['GDP']

NameError: name 'series' is not defined

如何让这个系列工作?我也尝试过简单

for i in gdp['Country.Name'] 

但仍然收到错误消息。

请帮忙!

1 个答案:

答案 0 :(得分:0)

假设您有以下输入Dataframe(请注意,在您的示例中数据Military percent of GDP不存在):

>>> gdp
  Country.Name           Indicator.Name          2004          2005
0        World                      GDP  5.590000e+13  5.810000e+13
1        World  Military percent of GDP  2.100000e+00  2.300000e+00
2     Zimbabwe                      GDP  1.628900e+10  1.700000e+10
3     Zimbabwe  Military percent of GDP  2.000000e+00  2.100000e+00

然后,您可以分别使用df_gdpdf_mpgdp的数据为20042005创建辅助数据框GDPMilitary percent of GDP。然后,您可以创建df_msp,其中包含名为Indicator.Name的新Military Spending,最后将其结果附加到原始Dataframe。请注意,在某些情况下我们需要reset_index以确保使用预期索引完成计算。

下面的代码应该适用于您的目标:

import pandas as pd
gdp = pd.DataFrame( [
["World",  "GDP",  5.590000e+13,  5.810000e+13], 
["World",  "Military percent of GDP",  2.1, 2.3], 
["Zimbabwe",  "GDP",  16289e6, 17000e6], 
["Zimbabwe",  "Military percent of GDP",  2, 2.1]])
gdp.columns = ["Country.Name", "Indicator.Name", "2004", "2005"]

df_gdp = gdp[gdp["Indicator.Name"] == "GDP"]
df_mpgdp = gdp[gdp["Indicator.Name"] == "Military percent of GDP"]

df_msp = pd.DataFrame()
df_msp["Country.Name"] = df_gdp["Country.Name"].reset_index(drop=True)
df_msp["Indicator.Name"] = "Military Spending"
df_msp["2004"] = 100 / df_mpgdp[["2004"]].reset_index(drop=True) *  df_gdp[["2004"]].reset_index(drop=True)
df_msp["2005"] = 100 / df_mpgdp[["2005"]].reset_index(drop=True) *  df_gdp[["2005"]].reset_index(drop=True)

gdp_out = gdp.append(df_msp)
gdp_out = gdp_out.sort_values(["Country.Name", "Indicator.Name"])
gdp_out = gdp_out.reset_index(drop=True)

最后输出Dataframe会导致:

>>> gdp_out
  Country.Name           Indicator.Name          2004          2005
0        World                      GDP  5.590000e+13  5.810000e+13
1        World        Military Spending  2.661905e+15  2.526087e+15
2        World  Military percent of GDP  2.100000e+00  2.300000e+00
3     Zimbabwe                      GDP  1.628900e+10  1.700000e+10
4     Zimbabwe        Military Spending  8.144500e+11  8.095238e+11
5     Zimbabwe  Military percent of GDP  2.000000e+00  2.100000e+00