我有像这样的pandas数据框
date,userId,classification 2018-03-29,55,Large 2018-03-30,55, small 2018-03-29,55, x-small 2018-04-20,65, Large 2018-04-29,75, x-small
如何填写缺失的日期,但每个userId填写60天的时间段?我尝试使用pandas使用索引日期,然后重新索引并填充它,但它给所有其他字段的所有空值。我可以使用python或java使用spark数据帧或pandas的任何解决方案。
我试过的代码
import pandas as pd
idx = pd.date_range('02-28-2018', '04-29-2018')
df = pd.DataFrame([['Chandler Bing','55','2018-03-29',51],
['Chandler Bing','55','2018-03-29',60],
['Chandler Bing','55','2018-03-30',59],
['Harry Kane','45','2018-04-30',80],
['Harry Kane','45','2018-04-21',90]],columns=['name','accountid','timestamp','size'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
print (df)
df= df.reindex(idx, fill_value=0)
print(df)
我得到的错误是'ValueError:无法从重复轴重新索引'
即使这个版本也不起作用
import pandas as pd
idx = pd.date_range('02-28-2018', '04-29-2018')
df = pd.DataFrame([['Chandler Bing','55','2018-03-29',51],
['Chandler Bing','55','2018-03-29',60],
['Chandler Bing','55','2018-03-30',59],
['Harry Kane','45','2018-04-30',80],
['Harry Kane','45','2018-04-21',90]],columns=['name','accountid','timestamp','size'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
pd.DatetimeIndex(df['timestamp'])
del(df['timestamp'])
#df.set_index('timestamp', inplace=True)
print (df)
df= df.reindex(idx, fill_value=0)
print (df)
uniquaccount=df['accountid'].unique()
print(uniquaccount)
答案 0 :(得分:0)
你可以在pandas系列中使用reindex
import pandas as pd
idx = pd.date_range('02-28-2018', '04-29-2018')
s = pd.Series({'2018-03-29' : 55,
'2018-03-30' : 55,
'2018-03-29' : 55,
'2018-04-20' : 65,
'2018-04-29' :75})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
将归咎于所有缺失的日期:
2018-03-28 0
2018-03-29 55
2018-03-30 55
2018-03-31 0
2018-04-01 0
2018-04-02 0
2018-04-03 0
2018-04-04 0
...
答案 1 :(得分:0)
重新索引不适用于非唯一索引。而是创建一个中间数据帧,每个时间戳/帐户组合仅包含一行,然后合并:
import pandas as pd
idx = pd.date_range('02-28-2018', '04-29-2018')
df = pd.DataFrame([['Chandler Bing','55','2018-03-29',51],
['Chandler Bing','55','2018-03-29',60],
['Chandler Bing','55','2018-03-30',59],
['Harry Kane','45','2018-04-30',80],
['Harry Kane','45','2018-04-21',90]],columns=['name','accountid','timestamp','size'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
# Step 1: create an intermediate dataframe with the cartesian product (CROSS JOIN)
# of all of the timestamps and IDs
idx = pd.Series(idx, name='timestamp').to_frame()
unique_accounts = df[['accountid', 'name']].drop_duplicates()
# Pandas CROSS JOIN, see https://stackoverflow.com/questions/53699012/performant-cartesian-product-cross-join-with-pandas/53699013#53699013
df_intermediate = pd.merge(unique_accounts.assign(dummy=1), idx.assign(dummy=1), on='dummy', how='inner')
df_intermediate = df_intermediate.drop(columns='dummy')
# Step 2: merge with the original dataframe, and fill missing values
df_new = df_intermediate.merge(df.drop(columns='name'), how='left', on=['accountid', 'timestamp'])
df_new['size'] = df_new['size'].fillna(value=0)
此外,考虑使用与“大小”不同的变量名。 size
是熊猫的保留名称。