熊猫:使用其他行中的信息填充缺少的日期

时间:2020-03-26 18:32:13

标签: python pandas date

假设我有以下熊猫数据框:

grp.....outcome
A.......1
A.......2
B.......NULL
B.......1

我需要使用2020-3-10和2020-3-11的北领地更新缺少日期的数据框。但是,我想使用除案件和死亡之外的所有信息。像这样:

Date    Region  Country Cases   Deaths  Lat Long
2020-03-08  Northern Territory  Australia   27  49  -12.4634    130.8456
2020-03-09  Northern Territory  Australia   80  85  -12.4634    130.8456
2020-03-12  Northern Territory  Australia   35  73  -12.4634    130.8456
2020-03-08  Western Australia   Australia   48  20  -31.9505    115.8605
2020-03-09  Western Australia   Australia   70  12  -31.9505    115.8605
2020-03-10  Western Australia   Australia   66  95  -31.9505    115.8605
2020-03-11  Western Australia   Australia   31  38  -31.9505    115.8605
2020-03-12  Western Australia   Australia   40  83  -31.9505    115.8605

我想到的唯一方法就是遍历日期和国家/地区的所有组合。

编辑

Efran似乎处在正确的轨道上,但我无法使其正常工作。这是我正在使用的实际数据,而不是玩具示例。

Date    Region  Country Cases   Deaths  Lat Long
2020-03-08  Northern Territory  Australia   27  49  -12.4634    130.8456
2020-03-09  Northern Territory  Australia   80  85  -12.4634    130.8456
2020-03-10  Northern Territory  Australia   0   0   -12.4634    130.8456
2020-03-11  Northern Territory  Australia   0   0   -12.4634    130.8456
2020-03-12  Northern Territory  Australia   35  73  -12.4634    130.8456
2020-03-08  Western Australia   Australia   48  20  -31.9505    115.8605
2020-03-09  Western Australia   Australia   70  12  -31.9505    115.8605
2020-03-10  Western Australia   Australia   66  95  -31.9505    115.8605
2020-03-11  Western Australia   Australia   31  38  -31.9505    115.8605
2020-03-12  Western Australia   Australia   40  83  -31.9505    115.8605

您可以看到它没有按照指定的日期插入日期重采样。我不知道怎么了。

编辑2

这是我基于Erfan的解决方案。

import pandas as pd

unique_group = ['province','country','county']
csbs_df = pd.read_csv(
        'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz', index_col=0)

csbs_df['Date'] = pd.to_datetime(csbs_df['Date'], infer_datetime_format=True)
new_df = (
    csbs_df.set_index('Date')
    .groupby(unique_group)
    .resample('D').first()
    .fillna(dict.fromkeys(['confirmed', 'deaths'], 0))
    .ffill()
    .reset_index(level=3)
    .reset_index(drop=True))
new_df.head()
Date    id  lat lon Timestamp   province    country_code    country county  confirmed   deaths  source  Date_text
0   2020-03-25  1094.0  32.534893   -86.642709  2020-03-25 00:00:00+00:00   Alabama US  US  Autauga 1.0 0.0 CSBS    03/25/20
1   2020-03-26  901.0   32.534893   -86.642709  2020-03-26 00:00:00+00:00   Alabama US  US  Autauga 4.0 0.0 CSBS    03/26/20
2   2020-03-24  991.0   30.735891   -87.723525  2020-03-24 00:00:00+00:00   Alabama US  US  Baldwin 3.0 0.0 CSBS    03/24/20
3   2020-03-25  1080.0  30.735891   -87.723525  2020-03-25 00:00:00+00:00   Alabama US  US  Baldwin 4.0 0.0 CSBS    03/25/20
4   2020-03-26  1139.0  30.735891   -87.723525  2020-03-26 16:52:00+00:00   Alabama US  US  Baldwin 4.0 0.0 CSBS    03/26/20

1 个答案:

答案 0 :(得分:3)

使用GroupBy.resampleffillfillna

这里的想法是我们希望为RegionCountry每组“填充”缺少的日期间隔。这称为时间序列重采样。

所以这就是为什么我们在这里使用GroupBy.resample而不是DataFrame.resample的原因。还需要更多fillnaffill来根据您的逻辑填充数据。

df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
dfn = (
    df.set_index('Date')
    .groupby(['Region', 'Country'])
    .resample('D').first()
    .fillna(dict.fromkeys(['Cases', 'Deaths'], 0))
    .ffill()
    .reset_index(level=2)
    .reset_index(drop=True)
)

        Date              Region    Country  Cases  Deaths      Lat      Long
0 2020-03-08  Northern Territory  Australia   27.0    49.0 -12.4634  130.8456
1 2020-03-09  Northern Territory  Australia   80.0    85.0 -12.4634  130.8456
2 2020-03-10  Northern Territory  Australia    0.0     0.0 -12.4634  130.8456
3 2020-03-11  Northern Territory  Australia    0.0     0.0 -12.4634  130.8456
4 2020-03-12  Northern Territory  Australia   35.0    73.0 -12.4634  130.8456
5 2020-03-08   Western Australia  Australia   48.0    20.0 -31.9505  115.8605
6 2020-03-09   Western Australia  Australia   70.0    12.0 -31.9505  115.8605
7 2020-03-10   Western Australia  Australia   66.0    95.0 -31.9505  115.8605
8 2020-03-11   Western Australia  Australia   31.0    38.0 -31.9505  115.8605
9 2020-03-12   Western Australia  Australia   40.0    83.0 -31.9505  115.8605

编辑:

似乎并非所有地方都有相同的开始和结束日期,所以我们必须考虑到这一点,以下工作可以解决:

csbs_df = pd.read_csv(
        'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz'
).iloc[:, 1:]

csbs_df['Date_text'] = pd.to_datetime(csbs_df['Date_text'])
date_range = pd.date_range(csbs_df['Date_text'].min(), csbs_df['Date_text'].max(), freq='D')

def reindex_dates(data, dates):
    data = data.reindex(dates).fillna(dict.fromkeys(['Cases', 'Deaths'], 0)).ffill().bfill()
    return data

dfn = (
    csbs_df.set_index('Date_text')
    .groupby('id').apply(lambda x: reindex_dates(x, date_range))
    .reset_index(level=0, drop=True)
    .reset_index()
    .rename(columns={'index': 'Date'})
)

print(dfn.head())


        Date   id        lat        lon                  Timestamp  \
0 2020-03-24  0.0  40.714550 -74.007140  2020-03-24 00:00:00+00:00   
1 2020-03-25  0.0  40.714550 -74.007140  2020-03-25 00:00:00+00:00   
2 2020-03-26  0.0  40.714550 -74.007140  2020-03-26 00:00:00+00:00   
3 2020-03-24  1.0  41.163198 -73.756063  2020-03-24 00:00:00+00:00   
4 2020-03-25  1.0  41.163198 -73.756063  2020-03-25 00:00:00+00:00   

         Date  province country_code country       county  confirmed  deaths  \
0  2020-03-24  New York           US      US     New York    13119.0   125.0   
1  2020-03-25  New York           US      US     New York    15597.0   192.0   
2  2020-03-26  New York           US      US     New York    20011.0   280.0   
3  2020-03-24  New York           US      US  Westchester     2894.0     0.0   
4  2020-03-25  New York           US      US  Westchester     3891.0     1.0   

  source  
0   CSBS  
1   CSBS  
2   CSBS  
3   CSBS  
4   CSBS