我正在尝试将此示例数据帧从长格式重塑为宽格式,而不汇总任何数据。
import numpy as np
import pandas as pd
df = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'B', 'C', 'A'], 'Date':
['2010-03-14', '2010-03-15', '2010-03-16', '2010-03-14', '2010-05-15',
'2010-03-14', '2010-03-14'], 'Var1': [1 , 12, 4, 7, 90, 1, 9], 'Var2': [ 0,
0, 1, 1, 1, 0, 1], 'Var3': [np.nan, 1, 0, np.nan, 0, 1, np.nan]})
df['Date'] = pd.to_datetime(df['Date']); df
Date SubjectID Var1 Var2 Var3
0 2010-03-14 A 1 0 NaN
1 2010-03-15 A 12 0 1.0
2 2010-03-16 A 4 1 0.0
3 2010-03-14 B 7 1 NaN
4 2010-05-15 B 90 1 0.0
5 2010-03-14 C 1 0 1.0
6 2010-03-14 A 9 1 NaN
要解决重复的值,我将按"Date"
列进行分组,并获取每个值的累积计数。然后我制作一个数据透视表
df['idx'] = df.groupby('Date').cumcount()
dfp = df.pivot_table(index = 'SubjectID', columns = 'idx'); dfp
Var1 Var2 Var3
idx 0 1 2 3 0 1 2 3 0 2
SubjectID
A 5.666667 NaN NaN 9.0 0.333333 NaN NaN 1.0 0.5 NaN
B 90.000000 7.0 NaN NaN 1.000000 1.0 NaN NaN 0.0 NaN
C NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN 1.0
但是,我希望idx
列索引是"Date"
列中的值,并且我不想聚合任何数据。预期的输出是
Var1_2010-03-14 Var1_2010-03-14 Var1_2010-03-15 Var1_2010-03-16 Var1_2010-05-15 Var2_2010-03-14 Var2_2010-03-15 Var2_2010-03-16 Var2_2010-05-15 Var3_2010-03-14 Var3_2010-03-15 Var3_2010-03-16 Var3_2010-05-15
SubjectID
A 1 9 12 4 NaN 0 1 0 1.0 NaN NaN NaN 1.0 0.0 NaN
B 7.0 NaN NaN NaN 90 1 NaN NaN 1.0 NaN NaN NaN NaN NaN 0.0
C 1 NaN NaN NaN NaN 0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
我该怎么做?最终,我将通过dfp.columns = [col[0]+ '_' + str(col[1]) for col in dfp.columns]
合并两个列索引。
答案 0 :(得分:1)
您在正确的路径上:
# group
df['idx'] = df.groupby('Date').cumcount()
# set index and unstack
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
# drop idx column
new.columns = new.columns.droplevel(1)
new.columns = [f'{val}_{date}' for val, date in new.columns]
我认为这是您的预期输出
使用map
看起来会更快一些:
df['idx'] = df.groupby('Date').cumcount()
df['Date'] = df['Date'].astype(str)
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
new.columns = new.columns.droplevel(1)
#new.columns = [f'{val}_{date}' for val, date in new.columns]
new.columns = new.columns.map('_'.join)
这是一个50,000行测试示例:
#data
data = pd.DataFrame(pd.date_range('2000-01-01', periods=50000, freq='D'))
data['a'] = list('abcd')*12500
data['b'] = 2
data['c'] = list('ABCD')*12500
data.rename(columns={0:'date'}, inplace=True)
# list comprehension:
%%timeit -r 3 -n 200
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = [f'{x}_{y}' for x,y in new.columns]
# 98.2 ms ± 13.3 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)
# map with join:
%%timeit -r 3 -n 200
data['date'] = data['date'].astype(str)
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = new.columns.map('_'.join)
# 84.6 ms ± 3.87 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)