我有一个按国家和日期多索引的数据框,如下所示:
我需要的是创建一个按国家/地区分组的表格,其中包含一个额外的列,其中包含每个日期的测量序列,如下所示:
请注意,序列中包含某些国家/地区存在的日期的零,但其他国家/地区缺少这些日期。
我能做到这一点的唯一方法是迭代每个国家:
# discard dates and create dataframe grouped by countries
grouped_chunk = grouped_tds_df.groupby('country__name').sum()
# create index containing uninterrupted sequence of dates
full_date_range = pd.date_range(datetime(2017, 4, 13).date(), datetime(2017, 4, 18).date())
# iterate over each country
for country_idx in grouped_chunk.index:
# get rows that contain data for this country
this_country_raws = grouped_tds_df[grouped_tds_df['country__name'] == country_idx]
# reindex them to include missing dates
this_country_raws = this_country_raws.set_index('date').reindex(
full_date_range, fill_value=0
)
# pick a list of values for sequence
joined = ','.join(str(l) for l in this_country_raws['raws'])
# insert sequence into original table
grouped_chunk.loc[country_idx, 'raws_sequence'] = joined
但有没有办法在没有迭代的情况下完成它?也许对原始表的索引级别2(date
)进行一些批量重建索引?我不想为每个国家发射重新索引。
答案 0 :(得分:1)
将unstack
与stack
一起使用,以捕获笛卡尔积的缺失元素。
f = dict(
raws=dict(raws='sum', raws_sequence=lambda x: list(x)),
cost=dict(cost='sum')
)
d1 = df.unstack(fill_value=0).stack().groupby(level='country_name').agg(f)
d1.columns = d1.columns.droplevel(0)
d1
raws raws_sequence cost
country_name
AD 18 [12, 1, 4, 1, 0, 0] 0.0018
AE 11444 [0, 0, 3336, 2619, 2520, 2969] 1.1444
AF 31120 [0, 0, 12701, 9602, 6979, 1838] 3.1120
AG 5306 [0, 0, 1161, 1514, 1065, 1566] 0.5306
设置
mux = pd.MultiIndex.from_tuples([
['AD', '2017-04-10'],
['AD', '2017-04-11'],
['AD', '2017-04-12'],
['AD', '2017-04-13'],
['AE', '2017-04-12'],
['AE', '2017-04-13'],
['AE', '2017-04-14'],
['AE', '2017-04-15'],
['AF', '2017-04-12'],
['AF', '2017-04-13'],
['AF', '2017-04-14'],
['AF', '2017-04-15'],
['AG', '2017-04-12'],
['AG', '2017-04-13'],
['AG', '2017-04-14'],
['AG', '2017-04-15'],
], names=['country_name', 'date'])
df = pd.DataFrame([
[12],
[1],
[4],
[1],
[3336],
[2619],
[2520],
[2969],
[12701],
[9602],
[6979],
[1838],
[1161],
[1514],
[1065],
[1566]
], mux, ['raws']).eval('cost = raws / 10000', inplace=False)
df
raws cost
country_name date
AD 2017-04-10 12 0.0012
2017-04-11 1 0.0001
2017-04-12 4 0.0004
2017-04-13 1 0.0001
AE 2017-04-12 3336 0.3336
2017-04-13 2619 0.2619
2017-04-14 2520 0.2520
2017-04-15 2969 0.2969
AF 2017-04-12 12701 1.2701
2017-04-13 9602 0.9602
2017-04-14 6979 0.6979
2017-04-15 1838 0.1838
AG 2017-04-12 1161 0.1161
2017-04-13 1514 0.1514
2017-04-14 1065 0.1065
2017-04-15 1566 0.1566