我有一个pandas数据框,它有两列作为唯一值(eng_id,date)。我需要将其转换为以下形状,并通过equipment_id唯一值及其测量值创建列。我怎么能这样做?
From:
eng_id date equipment_id measurement
1 2016-01 100 20
1 2016-01 200 46
1 2016-01 300 18
1 2016-04 200 33
1 2016-05 200 27
2 2016-01 300 9
2 2016-01 400 15
2 2016-05 400 65
2 2016-05 500 51
2 2016-05 600 16
To:
ID 100 200 300 400 500 600
1,2016-01 20 46 18 0 0 0
1,2016-04 0 33 0 0 0 0
1,2016-05 0 27 0 0 0 0
2,2016-01 0 0 9 15 0 0
2,2016-05 0 0 0 65 51 16
答案 0 :(得分:2)
将两个列标记为ID
并使用pivot
:
df['ID'] = df['eng_id'].astype(str) + ',' + df['date']
df = df.pivot(index='ID', columns='equipment_id', values='measurement').fillna(0).astype(int)
print (df)
equipment_id 100 200 300 400 500 600
ID
1,2016-01 20 46 18 0 0 0
1,2016-04 0 33 0 0 0 0
1,2016-05 0 27 0 0 0 0
2,2016-01 0 0 9 15 0 0
2,2016-05 0 0 0 65 51 16
df['ID'] = df['eng_id'].astype(str) + ',' + df['date']
df = df.set_index(['ID', 'equipment_id'])['measurement'].unstack(fill_value=0)
print (df)
equipment_id 100 200 300 400 500 600
ID
1,2016-01 20 46 18 0 0 0
1,2016-04 0 33 0 0 0 0
1,2016-05 0 27 0 0 0 0
2,2016-01 0 0 9 15 0 0
2,2016-05 0 0 0 65 51 16
但如果2
中需要ID
列:
df = df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0)
print (df)
equipment_id 100 200 300 400 500 600
eng_id date
1 2016-01 20 46 18 0 0 0
2016-04 0 33 0 0 0 0
2016-05 0 27 0 0 0 0
2 2016-01 0 0 9 15 0 0
2016-05 0 0 0 65 51 16
对于列添加reset_index
+ rename_axis
:
df = df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0)
.reset_index()
.rename_axis(None, axis=1)
print (df)
eng_id date 100 200 300 400 500 600
0 1 2016-01 20 46 18 0 0 0
1 1 2016-04 0 33 0 0 0 0
2 1 2016-05 0 27 0 0 0 0
3 2 2016-01 0 0 9 15 0 0
4 2 2016-05 0 0 0 65 51 16
但如果得到:
ValueError:索引包含重复的条目,无法重塑
这意味着您有重复项,并且需要pivot_table
一些聚合函数,例如mean
,sum
...:
print (df)
eng_id date equipment_id measurement
0 1 2016-01 100 20 <-duplicate 1 2016-01 100
1 1 2016-01 100 30 <-duplicate 1 2016-01 100
2 1 2016-01 200 46
3 1 2016-01 300 18
4 1 2016-04 200 33
5 1 2016-05 200 27
6 2 2016-01 300 9
7 2 2016-01 400 15
8 2 2016-05 400 65
9 2 2016-05 500 51
10 2 2016-05 600 16
df['ID'] = df['eng_id'].astype(str) + ',' + df['date']
df = df.pivot_table(index='ID',
columns='equipment_id',
values='measurement',
fill_value=0,
aggfunc='mean')
print (df)
equipment_id 100 200 300 400 500 600
ID
1,2016-01 25 46 18 0 0 0 <= (20+30)/2=25
1,2016-04 0 33 0 0 0 0
1,2016-05 0 27 0 0 0 0
2,2016-01 0 0 9 15 0 0
2,2016-05 0 0 0 65 51 16
或使用groupby
+ aggregate function
+ unstack
:
df['ID'] = df['eng_id'].astype(str) + ',' + df['date']
df = df.groupby(['ID', 'equipment_id'])['measurement'].mean().unstack(fill_value=0)
print (df)
equipment_id 100 200 300 400 500 600
ID
1,2016-01 25 46 18 0 0 0 <= (20+30)/2=25
1,2016-04 0 33 0 0 0 0
1,2016-05 0 27 0 0 0 0
2,2016-01 0 0 9 15 0 0
2,2016-05 0 0 0 65 51 16
答案 1 :(得分:0)
['eng_id', 'date', 'equipment_id']
的组合是唯一的。
z = list(zip(df.eng_id.values.tolist(), df.date.values.tolist()))
# i will be the positions I will use to insert into the values array
# u will be the tuples that make up the index
i, u = pd.Series(z).factorize()
idx = pd.MultiIndex.from_tuples(u, names=['eng_id', 'date'])
# j will bet be positions I will use to insert into the values array
# col will be the column labels
j, col = df.equipment_id.factorize()
# Create a place holder dataframe
d = pd.DataFrame(0, idx, col)
# fill the values
d.values[i, j] = df.measurement.values
print(d)
100 200 300 400 500 600
eng_id date
1 2016-01 20 46 18 0 0 0
2016-04 0 33 0 0 0 0
2016-05 0 27 0 0 0 0
2 2016-01 0 0 9 15 0 0
2016-05 0 0 0 65 51 16
<强>时序强>
小数据
对于大数据,这可能看起来不同,我还没有测试过。
%%timeit
z = list(zip(df.eng_id.values.tolist(), df.date.values.tolist()))
i, u = pd.Series(z).factorize()
idx = pd.MultiIndex.from_tuples(u, names=['eng_id', 'date'])
j, col = df.equipment_id.factorize()
d = pd.DataFrame(0, idx, col)
d.values[i, j] = df.measurement.values
1000 loops, best of 3: 885 µs per loop
%timeit df.set_index(['eng_id', 'date', 'equipment_id'])['measurement'].unstack(fill_value=0)
100 loops, best of 3: 1.96 ms per loop