我有一个存储在DataFrame中的国家/地区的年度信息(COUNT)。 但是,有些国家在某些年份失踪。
如果我有完整的国家/地区列表,那么在相应年份下添加它们的最佳方式是什么,并将COUNT的缺失值填入0?
DATE COUNTRY COUNTRY_ID COUNT
0 1980 United States 840 42
42 1980 Czech Republic 203 2
95 1980 Hungary 348 1
96 1980 Great Britain 826 1
97 1980 South Africa 710 1
98 1982 United States 840 42
140 1982 Paraguay 600 2
.
.
答案 0 :(得分:1)
执行此操作的一种方法是组合所有DATE,COUNTRY组合,然后reindex
DataFrame,最后填写缺失值。
# Assume that we want all years not just the ones seen
years = range(df['DATE'].min(), df['DATE'].max()+1)
# get all combinations
idx = pd.MultiIndex.from_product([years, df['COUNTRY'].unique()], names=['DATE', 'COUNTRY'])
# reindex by first putting DATE and COUNTRY into the index
df1 = df.set_index(['DATE', 'COUNTRY']).reindex(idx).reset_index()
# Fill back in missing IDs
country_map = df.set_index('COUNTRY')['COUNTRY_ID'].drop_duplicates()
df1['COUNTRY_ID'] = df1.COUNTRY.map(country_map)
# fill in 0 for COUNT and convert back to int
df1['COUNT'] = df1['COUNT'].fillna(0).astype(int)
DATE COUNTRY COUNTRY_ID COUNT
0 1980 United States 840 42
1 1980 Czech Republic 203 2
2 1980 Hungary 348 1
3 1980 Great Britain 826 1
4 1980 South Africa 710 1
5 1980 Paraguay 600 0
6 1981 United States 840 0
7 1981 Czech Republic 203 0
8 1981 Hungary 348 0
9 1981 Great Britain 826 0
10 1981 South Africa 710 0
11 1981 Paraguay 600 0
12 1982 United States 840 42
13 1982 Czech Republic 203 0
14 1982 Hungary 348 0
15 1982 Great Britain 826 0
16 1982 South Africa 710 0
17 1982 Paraguay 600 2
答案 1 :(得分:0)
还考虑一个交叉连接merge
路由(对于我们这些具有SQL思维模式的人)
# ASSIGN KEY COLUMN
df['KEY'] = 1
# CREATE DF OF DATES RANGE
dates = pd.DataFrame({'DATE':list(range(df['DATE'].min(),df['DATE'].max() + 1)),
'COUNT':0, 'KEY':1})
# CROSS JOIN MERGE
mdf = df.merge(dates, on=['KEY'])
# REASSIGN COUNT
mdf.loc[mdf['DATE_x'] != mdf['DATE_y'], 'COUNT_x'] = 0
# CLEAN UP DF (COLS AND ROWS)
mdf = mdf[['DATE_y', 'COUNTRY', 'COUNTRY_ID', 'COUNT_x']]\
.rename(columns={'DATE_y':'DATE', 'COUNT_x':'COUNT'})\
.drop_duplicates(['DATE', 'COUNTRY', 'COUNTRY_ID'])\
.sort_values('DATE')\
.reset_index(drop=True)
# DATE COUNTRY COUNTRY_ID COUNT
# 0 1980 United States 840 42
# 1 1980 Paraguay 600 0
# 2 1980 Czech Republic 203 2
# 3 1980 Hungary 348 1
# 4 1980 Great Britain 826 1
# 5 1980 South Africa 710 1
# 6 1981 United States 840 0
# 7 1981 Czech Republic 203 0
# 8 1981 Hungary 348 0
# 9 1981 Paraguay 600 0
# 10 1981 Great Britain 826 0
# 11 1981 South Africa 710 0
# 12 1982 South Africa 710 0
# 13 1982 Hungary 348 0
# 14 1982 Czech Republic 203 0
# 15 1982 United States 840 0
# 16 1982 Great Britain 826 0
# 17 1982 Paraguay 600 2