我正在使用时间序列Pandas数据框,其中每个公司都有一些按日期索引的观察值。对于每个观察,都有多个值。我一直在尝试使用熊猫MultiIndex与堆栈/非堆栈进行比较,但已经接近预期的结果,但是卡住了:
这就是我尝试过的-完整的代码可以重现:
import pandas.util.testing as tm; tm.N = 3
import numpy as np
def unpivot(frame):
N, K = frame.shape
data = {'value' : frame.values.ravel('F'),
'variable' : np.asarray(frame.columns).repeat(N),
'date' : np.tile(np.asarray(frame.index), K)}
return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())
df['topic'] = df.groupby('variable').cumcount()+1
df['company'] = np.random.choice(['Apple Inc', 'Google', 'Nike Corp'], len(df))
这就是我的开始:
date variable value topic company
0 2000-01-03 A -1.250926 1 Apple Inc
1 2000-01-04 A -0.807887 2 Google
2 2000-01-05 A 0.302593 3 Nike Corp
3 2000-01-03 B 0.502474 1 Nike Corp
4 2000-01-04 B 0.974968 2 Apple Inc
5 2000-01-05 B 0.698478 3 Google
6 2000-01-03 C 0.871437 1 Nike Corp
7 2000-01-04 C 0.235059 2 Apple Inc
8 2000-01-05 C -1.827321 3 Google
9 2000-01-03 D -0.764457 1 Google
10 2000-01-04 D 0.014109 2 Nike Corp
11 2000-01-05 D -0.966902 3 Google
我的目标是这样的:
topic 1 2 3
company variable date
Apple Inc A 2000-01-03 -1.2 NaN NaN
B 2000-01-04 NaN 0.9 NaN
C 2000-01-04 NaN 0.2 NaN
Google A 2000-01-04 NaN -0.8 NaN
B 2000-01-05 NaN NaN 0.69
我已经尝试过了,但是'date'不正确:
df.set_index(['company', 'variable', 'date'], inplace=False).unstack()
value topic \
date 2000-01-03 2000-01-04 2000-01-05 2000-01-03 2000-01-04
company variable
Apple Inc A -1.250926 NaN NaN 1.0 NaN
B NaN 0.974968 NaN NaN 2.0
C NaN 0.235059 NaN NaN 2.0
Google A NaN -0.807887 NaN NaN 2.0
B NaN NaN 0.698478 NaN NaN
C NaN NaN -1.827321 NaN NaN
D -0.764457 NaN -0.966902 1.0 NaN
Nike Corp A NaN NaN 0.302593 NaN NaN
B 0.502474 NaN NaN 1.0 NaN
C 0.871437 NaN NaN 1.0 NaN
D NaN 0.014109 NaN NaN 2.0
date 2000-01-05
company variable
Apple Inc A NaN
B NaN
C NaN
Google A NaN
B 3.0
C 3.0
D 3.0
Nike Corp A 3.0
B NaN
C NaN
D NaN
答案 0 :(得分:2)
使用get_dummies
s=df.set_index(['company','variable','date']).topic.astype(str).str.get_dummies()
s.mul(s.columns).sort_index(level=0).sum(level=[0,1,2])
Out[335]:
1 2 3
company variable date
AppleInc A 2000-01-03 1
B 2000-01-05 3
D 2000-01-03 1
2000-01-05 3
Google A 2000-01-04 2
2000-01-05 3
B 2000-01-03 1
C 2000-01-03 1
2000-01-04 2
D 2000-01-04 2
NikeCorp B 2000-01-04 2
C 2000-01-05 3
自问题更新以来更新
df.pivot_table(index=['company','variable','date'],columns='topic',values='value')