Question

我有一个dataFrame，其中包含每年特定国家/地区的人口，以及一个pandas系列，其中包含每年的世界人口。这是我正在使用的系列：

pop_tot = df3.groupby('Year')['population'].sum()
Year     
1990    4.575442e+09
1991    4.659075e+09
1992    4.699921e+09
1993    4.795129e+09
1994    4.862547e+09
1995    4.949902e+09
...     ...
2017    6.837429e+09

这是我正在使用的DataFrame

        Country      Year   HDI     population
0       Afghanistan 1990    NaN     1.22491e+07
1       Albania     1990    0.645   3.28654e+06
2       Algeria     1990    0.577   2.59124e+07
3       Andorra     1990    NaN     54509
4       Angola      1990    NaN     1.21714e+07
...     ...         ...     ...     ...
4096    Uzbekistan  2017    0.71    3.23872e+07 
4097    Vanuatu     2017    0.603   276244  
4098    Zambia      2017    0.588   1.70941e+07 
4099    Zimbabwe    2017    0.535   1.65299e+07

我想计算该国家的人口每年代表的世界人口比例，因此我按如下方式遍历Series和DataFrame：

j = 0
for i in range(len(df3)):
    if df3.iloc[i,1]==pop_tot.index[j]:
        df3['pop_tot']=pop_tot[j] #Sanity check
        df3['weighted']=df3['population']/pop_tot[j]
        *df3.iloc[i,2]
    else:
        j=j+1

但是，我得到的DataFrame不是预期的。我最终将所有值除以2017年的总人口，从而得出与那年不正确的比例（即，对于第一行，pop_tot应该为4.575442e + 09，因为根据该系列，它对应于1990年）以上，而不是对应于2017年的6.837429e + 09）。

     Country   Year HDI   population  pop_tot      weighted
  0  Albania   1990 0.645 3.28654e+06 6.837429e+09 0.000257158
  1  Algeria   1990 0.577 2.59124e+07 6.837429e+09 0.00202753
  2  Argentina 1990 0.704 3.27297e+07 6.837429e+09 0.00256096

但是我看不出循环中有什么错误。预先感谢。

Answer 1

您不需要循环，可以使用groupby.transform在pop_tot中直接创建列df3。然后对列weighted进行列操作，例如：

df3['pop_tot'] = df3.groupby('Year')['population'].transform(sum)
df3['weighted'] = df3['population']/df3['pop_tot']

正如@roganjosh指出的那样，您的方法存在的问题是，每当满足条件pop_tot时就替换整列weighted和if，因此在最后一次迭代中满足条件（可能是2017年），您将pop_tot列的值定义为2017年之一，并使用该值计算周长。

Answer 2

您不必循环，它速度较慢，并且可以使事情变得非常复杂。例如，使用pandas和numpys向量化解决方案：

df['pop_tot'] = df.population.sum()
df['weighted'] =  df.population / df.population.sum()

print(df)
       Country  Year    HDI  population     pop_tot  weighted
0  Afghanistan  1990    NaN  12249100.0  53673949.0  0.228213
1      Albania  1990  0.645   3286540.0  53673949.0  0.061232
2      Algeria  1990  0.577  25912400.0  53673949.0  0.482774
3      Andorra  1990    NaN     54509.0  53673949.0  0.001016
4       Angola  1990    NaN  12171400.0  53673949.0  0.226766

在OP评论后进行编辑

df['pop_tot'] = df.groupby('Year').population.transform('sum')

df['weighted'] =  df.population / df['pop_tot']

print(df)
       Country  Year    HDI  population     pop_tot  weighted
0  Afghanistan  1990    NaN  12249100.0  53673949.0  0.228213
1      Albania  1990  0.645   3286540.0  53673949.0  0.061232
2      Algeria  1990  0.577  25912400.0  53673949.0  0.482774
3      Andorra  1990    NaN     54509.0  53673949.0  0.001016
4       Angola  1990    NaN  12171400.0  53673949.0  0.226766

注释
我以您提供的小型数据集为例：

    Country     Year    HDI     population
0   Afghanistan 1990    NaN     12249100.0
1   Albania     1990    0.645   3286540.0
2   Algeria     1990    0.577   25912400.0
3   Andorra     1990    NaN     54509.0
4   Angola      1990    NaN     12171400.0

循环仅取最后一个值

2 个答案: