我有以下数据框:
Timestamp id lat long
0 665047 a 30.508420 -84.372882
1 665047 b 30.491882 -84.372938
2 2058714 b 30.492026 -84.372938
3 665348 a 30.508420 -84.372882
4 2055292 b 30.491899 -84.372938
我希望的结果是:
Timestamp a b
0 665047 [30.508420, -84.372882] [30.491882, -84.372938]
1 665348 [30.508420, -84.372882] NaN
2 2055292 NaN [30.491899, -84.372938]
3 2058714 NaN [30.492026, -84.372938]
df.id
中找到的唯一值变为列标题(可能有数千个),其纬度和经度为值。
我最接近的是使用:
for i, r in df.iterrows():
dct[r.Timestamp].append([r.id, r.lat, r.long])
pd.DataFrame.from_dict(dct, orient='index')
0 1
2055292 [b, 30.491899, -84.372938] None
2058714 [b, 30.492026, -84.372938] None
665348 [a, 30.50842, -84.37288199999999] None
665047 [a, 30.50842, -84.37288199999999] [b, 30.491882, -84.372938]
但我知道在熊猫中使用任何类型的迭代都是不好的(并且它与我期望的结果无关),我确信有更简单的方法。
答案 0 :(得分:3)
我认为这与unstack
:
(df.groupby(['Timestamp', 'id'])
.apply(lambda x: x[['lat', 'long']].values.flatten())
.unstack(level='id'))
id a b
Timestamp
665047 [30.50842, -84.372882] [30.491882, -84.372938]
665348 [30.50842, -84.372882] None
2055292 None [30.491899, -84.372938]
2058714 None [30.492026, -84.372938]
答案 1 :(得分:2)
然后设置索引pipe
df.set_index(['Timestamp', 'id']).pipe(
lambda d: pd.Series(d.values.tolist(), d.index).unstack()
)
id a b
Timestamp
665047 [30.50842, -84.37288199999999] [30.491882, -84.372938]
665348 [30.50842, -84.37288199999999] None
2055292 None [30.491899, -84.372938]
2058714 None [30.492026, -84.372938]
cols = ['Timestamp', 'id', 'lat', 'long']
pd.Series({
t[:2]: list(t[2:])
for t in df[cols].itertuples(index=False)
}).unstack()
a b
665047 [30.50842, -84.37288199999999] [30.491882, -84.372938]
665348 [30.50842, -84.37288199999999] None
2055292 None [30.491899, -84.372938]
2058714 None [30.492026, -84.372938]