我有一个相对较大的df(10 ^ 6条记录)结构如下:
Date,SN,Zip Code,A,B,Total,Lat,Lon
2015-09-01,10948.0,80015,0,0,1,39.626999999999995,-104.779
2015-09-01,11906.0,85392,0,0,1,33.478,-112.309
2015-09-03,10948.0,85260,0,0,1,33.611,-111.891
2015-09-03,11906.0,85050,0,0,1,33.683,-111.99799999999999
2015-09-05,12111.0,23834,0,0,1,37.291,-77.404
2015-09-05,11906.0,72761,0,0,1,36.169000000000004,-94.455
请注意,每个SN
(唯一标识符)每天最多最多 1条记录。有些日子,有些SN
没有记录,这意味着当天Total
为0。我想把这个df转换成一个numpy数组,它会显示每天(行)和Total
(列)的SN
,但填写{{1}所缺少的日期带有0。
答案 0 :(得分:1)
您需要pivot
:
df.pivot('Date', 'SN', 'Total').fillna(0)
#SN 10948.0 11906.0 12111.0
#Date
#2015-09-01 1.0 1.0 0.0
#2015-09-03 1.0 1.0 0.0
#2015-09-05 0.0 1.0 1.0
获取numpy
数组:
df.pivot('Date', 'SN', 'Total').fillna(0).values
#array([[ 1., 1., 0.],
# [ 1., 1., 0.],
# [ 0., 1., 1.]])
更新以获取所有日期,您可以使用reindex
:
# convert Date column to datetime
df['Date'] = pd.to_datetime(df.Date)
# pivot to wide format
df1 = df.pivot('Date', 'SN', 'Total').fillna(0)
# reindex to get all dates
df1.reindex(pd.date_range(df1.index.min(), df1.index.max())).fillna(0)
# SN 10948.0 11906.0 12111.0
#2015-09-01 1.0 1.0 0.0
#2015-09-02 0.0 0.0 0.0
#2015-09-03 1.0 1.0 0.0
#2015-09-04 0.0 0.0 0.0
#2015-09-05 0.0 1.0 1.0