如果有4列,则我们使用以下条件之一创建虚拟列:
每个新创建的变量(虚拟变量)的值将是具有其余列值和创建的虚拟变量的唯一组合的行数计数。
随附的表(源表和目标表)将有助于更好地理解问题
我尝试了添加的代码,用于从附加的源表生成附加的目标表(使用测试用例修改了实际示例),它可以工作,但是由于实际数据具有数百万条记录,因此代码会不断运行。有没有更快的方法来实现这一目标?
# df_d[(df_d['Type1'] == "X11"])&(df_d['Type2'] == "X1")&(df_d['Type3'] == "Y1") &(df_d["Action"]== action)].shape[0]
上面的陈述需要花费很多时间。任何关于更快方式的建议都会有所帮助
def find_number(k, action):
return df_d[(df_d['Type1'] == "X11"])&(df_d['Type2'] == "X1")&(df_d['Type3'] == "Y1") &(df_d["Action"]== action)].shape[0]
vals = df_d["Action"].count_values.keys()
for i in vals:
### code to call with each action values
创建Source表的代码:
import pandas as pd
df1 = pd.DataFrame({"Type1": ["x11","x11","x11","x12","x12","x12","x12","x12","x12"], "Type2": ["x1","x2","x2","x2","x1","x1","x2","x1","x1"], "Type3":["y1","y2","y3","y1","y2","y2","y2","y2", "y1"], "action":["A","A","A","B","B","B","A","A","A"]
})