我有这个数据框:
ID Date X choice
A 07/16/2019 . 123
A 07/17/2019 . 789
A 07/18/2019 . 0
A 07/19/2019 . 456
B 07/16/2019 . 0
B 07/16/2019 . 789
B 07/17/2019 . 0
B 07/18/2019 . 123
我想创建一个虚拟变量,说明是否要从一组替代方案(123、456、789)中选择特定的替代方案。
注意:
预期结果:
ID Date X choice 123_L 456_L 789_L
A 07/16/2019 . 123 0 0 0
A 07/17/2019 . 789 1 0 0
A 07/18/2019 . 0 0 0 1
A 07/19/2019 . 456 0 0 1
B 07/16/2019 . 0 0 0 0
B 07/16/2019 . 789 0 0 0
B 07/17/2019 . 0 0 0 1
B 07/18/2019 . 123 0 0 1
答案 0 :(得分:2)
您要get_dummies
:
new_df = (pd.get_dummies(df.choice.where(df.choice.ne(0))
.groupby(df['ID']).ffill()
.fillna(0).astype(int))
.groupby(df['ID'])
.shift(fill_value=0)
.add_suffix('_L')
)
pd.concat((df, new_df), axis=1)
输出:
ID Date X choice 0_L 123_L 456_L 789_L
0 A 07/16/2019 . 123 0 0 0 0
1 A 07/17/2019 . 789 0 1 0 0
2 A 07/18/2019 . 0 0 0 0 1
3 A 07/19/2019 . 456 0 0 0 1
4 B 07/16/2019 . 0 0 0 0 0
5 B 07/16/2019 . 789 1 0 0 0
6 B 07/17/2019 . 0 0 0 0 1
7 B 07/18/2019 . 123 0 0 0 1
答案 1 :(得分:1)
您可以先处理choice
,以将0
替换为缺失值,然后按组ID
首先转换,然后向前填充缺失值:
s = (df['choice'].mask(df['choice'].eq(0))
.groupby(df['ID'])
.apply(lambda x: x.shift().ffill()))
然后使用Series
中的get_dummies
,并为列表中或没有0
的原始列中的所有可能行添加DataFrame.reindex
,并用DataFrame.add_suffix
重命名列:
#possible values in list
#vals = [123, 456, 789]
#extract all sorted values without 0
vals = df['choice'].unique()
vals = np.sort(vals[vals != 0])
df = (df.join(pd.get_dummies(s)
.reindex(columns=vals, fill_value=0)
.add_suffix('_L')))
print (df)
ID Date X choice 123_L 456_L 789_L
0 A 07/16/2019 . 123 0 0 0
1 A 07/17/2019 . 789 1 0 0
2 A 07/18/2019 . 0 0 0 1
3 A 07/19/2019 . 456 0 0 1
4 B 07/16/2019 . 0 0 0 0
5 B 07/16/2019 . 789 0 0 0
6 B 07/17/2019 . 0 0 0 1
7 B 07/18/2019 . 123 0 0 1
答案 2 :(得分:0)
步骤
-使用ffill
创建一列将0替换为最后选择的列
-使用numpy广播将唯一选择与最终选择进行比较以获取虚拟数据
-从伪数据创建数据框并连接到dp
修改
-更正后的代码未考虑ID
-还添加了第5行进行测试
import numpy as np
import pandas as pd
ls = [("A", "07/16/2019 ", 123),
("A", "07/17/2019 ", 789),
("A", "07/18/2019 ", 0),
("A", "07/19/2019 ", 456),
("B", "07/16/2019 ", 789),
("B", "07/18/2019 ", 123),
("B", "07/17/2019 ", 0),
("B", "07/18/2019 ", 123),]
df = pd.DataFrame(ls, columns=["id", "date", "choice"])
df = df.sort_values("id")
prev_choice = df["choice"].mask(df["choice"]==0, np.nan).ffill().shift()
prev_choice[df["id"]!=df["id"].shift()] = 0
unique_choices = np.delete(np.unique(df["choice"]), 0)
last_choice = np.zeros((len(df), len(unique_choices)))
last_choice = np.equal(unique_choices[np.newaxis,:], prev_choice
.values[:, np.newaxis])
dummy_df = pd.DataFrame(last_choice, columns = [f"{choice}_L" for choice in unique_choices])
pd.concat([df, dummy_df], axis=1)
结果
id date choice 123_L 456_L 789_L
0 A 07/16/2019 123 False False False
1 A 07/17/2019 789 True False False
2 A 07/18/2019 0 False False True
3 A 07/19/2019 456 False False True
4 B 07/16/2019 789 False False False
5 B 07/18/2019 123 False False True
6 B 07/17/2019 0 True False False
7 B 07/18/2019 123 True False False