pandas - 获取一系列列值并将它们放入单元格

时间:2017-04-17 17:23:38

标签: pandas

我有一个按国家和日期多索引的数据框,如下所示:

enter image description here

我需要的是创建一个按国家/地区分组的表格,其中包含一个额外的列,其中包含每个日期的测量序列,如下所示:

enter image description here

请注意,序列中包含某些国家/地区存在的日期的零,但其他国家/地区缺少这些日期。

我能做到这一点的唯一方法是迭代每个国家:

# discard dates and create dataframe grouped by countries
grouped_chunk = grouped_tds_df.groupby('country__name').sum()

# create index containing uninterrupted sequence of dates
full_date_range = pd.date_range(datetime(2017, 4, 13).date(), datetime(2017, 4, 18).date())

# iterate over each country
for country_idx in grouped_chunk.index:

    # get rows that contain data for this country    
    this_country_raws = grouped_tds_df[grouped_tds_df['country__name'] == country_idx]

    # reindex them to include missing dates
    this_country_raws = this_country_raws.set_index('date').reindex(
        full_date_range, fill_value=0
    )

    # pick a list of values for sequence
    joined = ','.join(str(l) for l in this_country_raws['raws'])

    # insert sequence into original table
    grouped_chunk.loc[country_idx, 'raws_sequence'] = joined

但有没有办法在没有迭代的情况下完成它?也许对原始表的索引级别2(date)进行一些批量重建索引?我不想为每个国家发射重新索引。

1 个答案:

答案 0 :(得分:1)

unstackstack一起使用,以捕获笛卡尔积的缺失元素。

f = dict(
    raws=dict(raws='sum', raws_sequence=lambda x: list(x)),
    cost=dict(cost='sum')
)

d1 = df.unstack(fill_value=0).stack().groupby(level='country_name').agg(f)
d1.columns = d1.columns.droplevel(0)

d1

           raws                    raws_sequence    cost
country_name                                                
AD               18              [12, 1, 4, 1, 0, 0]  0.0018
AE            11444   [0, 0, 3336, 2619, 2520, 2969]  1.1444
AF            31120  [0, 0, 12701, 9602, 6979, 1838]  3.1120
AG             5306   [0, 0, 1161, 1514, 1065, 1566]  0.5306

设置

mux = pd.MultiIndex.from_tuples([
        ['AD', '2017-04-10'],
        ['AD', '2017-04-11'],
        ['AD', '2017-04-12'],
        ['AD', '2017-04-13'],
        ['AE', '2017-04-12'],
        ['AE', '2017-04-13'],
        ['AE', '2017-04-14'],
        ['AE', '2017-04-15'],
        ['AF', '2017-04-12'],
        ['AF', '2017-04-13'],
        ['AF', '2017-04-14'],
        ['AF', '2017-04-15'],
        ['AG', '2017-04-12'],
        ['AG', '2017-04-13'],
        ['AG', '2017-04-14'],
        ['AG', '2017-04-15'],
    ], names=['country_name', 'date'])
df = pd.DataFrame([
        [12],
        [1],
        [4],
        [1],
        [3336],
        [2619],
        [2520],
        [2969],
        [12701],
        [9602],
        [6979],
        [1838],
        [1161],
        [1514],
        [1065],
        [1566]
    ], mux, ['raws']).eval('cost = raws / 10000', inplace=False)
df

                          raws    cost
country_name date                     
AD           2017-04-10     12  0.0012
             2017-04-11      1  0.0001
             2017-04-12      4  0.0004
             2017-04-13      1  0.0001
AE           2017-04-12   3336  0.3336
             2017-04-13   2619  0.2619
             2017-04-14   2520  0.2520
             2017-04-15   2969  0.2969
AF           2017-04-12  12701  1.2701
             2017-04-13   9602  0.9602
             2017-04-14   6979  0.6979
             2017-04-15   1838  0.1838
AG           2017-04-12   1161  0.1161
             2017-04-13   1514  0.1514
             2017-04-14   1065  0.1065
             2017-04-15   1566  0.1566