我试图根据年份(整数)在数据框中生成缺失行。我尝试了一些在stackoverflow上找到的解决方案,但是没有用。它会向我返回错误:ValueError: cannot reindex from a duplicate axis
。
Add missing dates to pandas dataframe
Missing data, insert rows in Pandas and fill with NAN
data = {'id': [100, 100, 100, 100, 100, 200, 200],
'year': [2010, 2013, 2014, 2015, 2016, 2010, 2012],
'value': [3000, 1000, 2000, 1200, 1300, 2000, 1500]}
df = pd.DataFrame(data)
df
我在寻找什么
答案 0 :(得分:4)
想法是由DataFrame.set_index
创建的MultiIndex
和DataFrame.reindex
由MultiIndex.from_product
创建的MultiIndex
创建的:
data = {'id': [100, 100, 100, 100, 100, 200, 200],
'year': [2010, 2013, 2014, 2015, 2016, 2010, 2012],
'value': [3000, 1000, 2000, 1200, 1300, 2000, 1500]}
df = pd.DataFrame(data)
#you can specify minimal and maximal year by scalar
mux = pd.MultiIndex.from_product([df['id'].unique(),
np.arange(df['year'].min(), 2019)],
names=['id','year'])
df1 = df.set_index(['id','year']).reindex(mux, fill_value=0).reset_index()
print (df1)
id year value
0 100 2010 3000
1 100 2011 0
2 100 2012 0
3 100 2013 1000
4 100 2014 2000
5 100 2015 1200
6 100 2016 1300
7 100 2017 0
8 100 2018 0
9 200 2010 2000
10 200 2011 0
11 200 2012 1500
12 200 2013 0
13 200 2014 0
14 200 2015 0
15 200 2016 0
16 200 2017 0
17 200 2018 0
#you can specify minimal and maximal year by min and max functions
mux = pd.MultiIndex.from_product([df['id'].unique(),
np.arange(df['year'].min(), df['year'].max() + 1)],
names=['id','year'])
df2 = df.set_index(['id','year']).reindex(mux, fill_value=0).reset_index()
print (df2)
id year value
0 100 2010 3000
1 100 2011 0
2 100 2012 0
3 100 2013 1000
4 100 2014 2000
5 100 2015 1200
6 100 2016 1300
7 200 2010 2000
8 200 2011 0
9 200 2012 1500
10 200 2013 0
11 200 2014 0
12 200 2015 0
13 200 2016 0
答案 1 :(得分:3)
您要通过ID将数据插值到最大年份范围的IIUC:
# get the min and max year range
# change max_year if needed
min_year, max_year = df.year.min(), df.year.max()
# for merge
new_df = pd.DataFrame({'year':np.arange(min_year,max_year+1)})
# groupby:
new_df = df.groupby('id').apply(lambda x: new_df.merge(x, how='outer')
.drop('id',axis=1)
).reset_index()
new_df.value.fillna(0, inplace=True)
# this will fill other columns with copies of others
new_df.groupby('id').ffill()
输出:
id level_1 year value
0 100 0 2010 3000.0
1 100 1 2011 0.0
2 100 2 2012 0.0
3 100 3 2013 1000.0
4 100 4 2014 2000.0
5 100 5 2015 1200.0
6 100 6 2016 1300.0
7 200 0 2010 2000.0
8 200 1 2011 0.0
9 200 2 2012 1500.0
10 200 3 2013 0.0
11 200 4 2014 0.0
12 200 5 2015 0.0
13 200 6 2016 0.0
答案 2 :(得分:3)
另一种使用groupby
和reindex()
的方法
l=[g.set_index('year').reindex(range(2010,2019)) for i,g in df.groupby('id')]
final=pd.concat(l).reset_index()
final=final.assign(**{'id':final.id.ffill(),'value':final.value.fillna(0)})
print(final)
year id value
0 2010 100.0 3000.0
1 2011 100.0 0.0
2 2012 100.0 0.0
3 2013 100.0 1000.0
4 2014 100.0 2000.0
5 2015 100.0 1200.0
6 2016 100.0 1300.0
7 2017 100.0 0.0
8 2018 100.0 0.0
9 2010 200.0 2000.0
10 2011 200.0 0.0
11 2012 200.0 1500.0
12 2013 200.0 0.0
13 2014 200.0 0.0
14 2015 200.0 0.0
15 2016 200.0 0.0
16 2017 200.0 0.0
17 2018 200.0 0.0
答案 3 :(得分:2)
import datetime
import pandas as pd
year = datetime.datetime.today().year #Today date(year)
data = {'id': [100, 100, 100, 100, 100, 200, 200],
'year': [2010, 2013, 2014, 2015, 2016, 2010, 2012],
'value': [3000, 1000, 2000, 1200, 1300, 2000, 1500]}
df = pd.DataFrame(data)
min = df['year'].min() #Find min year in data's
difference = year-min
years = range(year, year -difference , -1)
data['year'] = years
df = pd.DataFrame.from_dict(data, orient='index') #When we add years,it gives an error 'arrays must all be same length' so avoid it with adding 'Nan
df.transpose()
print df
答案 4 :(得分:2)
这是一种方法,可为“缺失”年生成行:
data = {'id': [100, 100, 100, 100, 100, 200, 200],
'year': [2010, 2013, 2014, 2015, 2016, 2010, 2012],
'value': [3000, 1000, 2000, 1200, 1300, 2000, 1500]}
df = pd.DataFrame(data)
#Pick the existing year extremes
min_year = df.year.min()
max_year = df.year.max()
#Look for missing years
missing_years = [x for x in range(min_year, max_year+1) if x not in df.year.values]
#Generate a dataframe
df_missing = pd.DataFrame([[None, x, None] for x in missing_years], columns = df.columns)
#Append it to original
df.append(df_missing)
输出
+----+-------+-------+-------+
| | id | year | value |
+----+-------+-------+-------+
| 0 | 100 | 2010 | 3000 |
| 1 | 100 | 2013 | 1000 |
| 2 | 100 | 2014 | 2000 |
| 3 | 100 | 2015 | 1200 |
| 4 | 100 | 2016 | 1300 |
| 5 | 200 | 2010 | 2000 |
| 6 | 200 | 2012 | 1500 |
| 0 | None | 2011 | None |
+----+-------+-------+-------+
答案 5 :(得分:0)
使用pd.merge_ordered方法:
df=pd.DataFrame({'id':(100,100,100,100,100,200,200),'year':(2010,2013,2014,2015,2016,2010,2012),\
'value':[3000,1000,2000,1200,1300,2000,1500]})
df_year=pd.DataFrame({'year':(2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020)})
df_merged=pd.merge_ordered(df,df_year,left_by='id').fillna(0)