熊猫*有效地*将有效值按组复制到其他行

时间:2019-02-06 19:35:28

标签: python pandas dataframe nan

  

“过早的优化是万恶之源(但是一旦您有了一个丑陋的解决方案,这很高兴)” D.Knuth

给出此数据集

from io import StringIO
import pandas as pd

csv = StringIO("""country,year,surface,ground,tot_water,enviro,depend
Yemen,2012,2,1.5,2.1,0.55,0
Yemen,2013,,,,,
Yemen,2014,2,1.5,2.1,,0
Yemen,2015,,,,,
Yemen,2016,,,,,
Yemen,2017,,,,0.55,
Zambia,1995,,,,,
Zambia,1996,,,,,
Zambia,1997,104.8,47,104.8,31.48,23.47""")

df = pd.read_csv(csv)
df

Out[0]:
  country  year  surface  ground  tot_water  enviro  depend
0   Yemen  2012      2.0     1.5        2.1    0.55    0.00
1   Yemen  2013      NaN     NaN        NaN     NaN     NaN
2   Yemen  2014      2.0     1.5        2.1     NaN    0.00
3   Yemen  2015      NaN     NaN        NaN     NaN     NaN
4   Yemen  2016      NaN     NaN        NaN     NaN     NaN
5   Yemen  2017      NaN     NaN        NaN    0.55     NaN
6  Zambia  1995      NaN     NaN        NaN     NaN     NaN
7  Zambia  1996      NaN     NaN        NaN     NaN     NaN
8  Zambia  1997    104.8    47.0      104.8   31.48   23.47

我想应用['surface', 'ground', 'tot_water', 'enviro']列中的有效值并将其复制到所有国家/地区。我有一个解决方案,但可以进行一些优化。

伪代码:

  1. 遍历每个国家,并将数据框过滤到该子集
  2. 找到该列的第一个有效值的索引
  3. 将该列的值设置为等于该有效值

我的解决方案

vars_ = ['surface', 'ground', 'tot_water', 'enviro']
# for each country
for country in df.country.unique():
    # and each value in the
    filter_ = df.country == country
    for var in vars_:
        valid_ix = df[filter_][var].first_valid_index()
        df.loc[filter_, var] = df[var][valid_ix]

df

Out[]:
country  year  surface  ground  tot_water  enviro  depend
0   Yemen  2012      2.0     1.5        2.1    0.55    0.00
1   Yemen  2013      2.0     1.5        2.1    0.55     NaN
2   Yemen  2014      2.0     1.5        2.1    0.55    0.00
3   Yemen  2015      2.0     1.5        2.1    0.55     NaN
4   Yemen  2016      2.0     1.5        2.1    0.55     NaN
5   Yemen  2017      2.0     1.5        2.1    0.55     NaN
6  Zambia  1995    104.8    47.0      104.8   31.48     NaN
7  Zambia  1996    104.8    47.0      104.8   31.48     NaN
8  Zambia  1997    104.8    47.0      104.8   31.48   23.47

必须有一种更有效的方法。在中等大小的数据集上,这需要花费相当长的时间,并且for循环难看。任何建议/帮助将不胜感激!

1 个答案:

答案 0 :(得分:1)

您可以按国家/地区对数据进行分组,并使用填充和填充

df.groupby('country').bfill().ffill()

country     year    surface ground  tot_water   enviro  depend
0   Yemen   2012    2.0     1.5     2.1         0.55    0.00
1   Yemen   2013    2.0     1.5     2.1         0.55    0.00
2   Yemen   2014    2.0     1.5     2.1         0.55    0.00
3   Yemen   2015    2.0     1.5     2.1         0.55    0.00
4   Yemen   2016    2.0     1.5     2.1         0.55    0.00
5   Yemen   2017    2.0     1.5     2.1         0.55    0.00
6   Zambia  1995    104.8   47.0    104.8       31.48   23.47
7   Zambia  1996    104.8   47.0    104.8       31.48   23.47
8   Zambia  1997    104.8   47.0    104.8       31.48   23.47