如何创建多个虚拟变量(两列之间的交互)?

时间:2019-11-13 16:10:53

标签: pandas numpy

我需要为每个选择和每个城市创建一个虚拟变量。 选择集是一个整数列表:[10, 20, 30, 40, 50], 而城市集是一个字符串列表:['XX', 'YY', 'ZZ']

这是数据帧:

 choice city
     10   XX
     20   YY
     20   YY
     30   XX
     10   XX
     20   YY
     40   ZZ
     40   ZZ
     50   YY

预期结果:

 choice city  10_XX  10_YY  10_ZZ  20_XX  20_YY  20_ZZ  30_XX  30_YY  30_ZZ  40_XX  40_YY  40_ZZ  50_XX  50_YY  50_ZZ
     10   XX      1      0      0      0      0      0      0      0      0      0      0      0      0      0      0
     20   YY      0      0      0      0      1      0      0      0      0      0      0      0      0      0      0
     20   YY      0      0      0      0      1      0      0      0      0      0      0      0      0      0      0
     30   XX      0      0      0      0      0      0      1      0      0      0      0      0      0      0      0
     10   XX      1      0      0      0      0      0      0      0      0      0      0      0      0      0      0
     20   YY      0      0      0      0      1      0      0      0      0      0      0      0      0      0      0
     40   ZZ      0      0      0      0      0      0      0      0      0      0      0      1      0      0      0
     40   ZZ      0      0      0      0      0      0      0      0      0      0      0      1      0      0      0
     50   YY      0      0      0      0      0      0      0      0      0      0      0      0      0      1      0

2 个答案:

答案 0 :(得分:1)

您可以尝试:

import numpy as np
_choice=[10, 20, 30, 40, 50]
_city=["XX", "YY", "ZZ"]
for ch in _choice:
    for ci in _city:
        df[f"{ch}_{ci}"]=np.where((df["choice"]==ch)&(df["city"]==ci), 1,0)

并且没有for循环:

import numpy as np
import itertools
_choice=[10, 20, 30, 40, 50]
_city=["XX", "YY", "ZZ"]
opts=list(itertools.product(_choice, _city))

df[list(map(lambda x: f"{x[0]}_{x[1]}", opts))]=df.apply(lambda x: pd.Series({f"{el[0]}_{el[1]}": 1 if (x["choice"]==el[0]) & (x["city"]==el[1]) else 0 for el in opts}) , axis=1).reset_index(drop=True)

答案 1 :(得分:1)

您可以使用outer比较。


u = np.equal.outer(df, df).any(1).all(-1).view('i1')

array([[1, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int8)

现在返回所需的DataFrame:

index = pd.MultiIndex.from_frame(df)
columns = index.map("{0[0]}_{0[1]}".format)

allc = set(
  f'{i}_{j}' for i in df['choice'] for j in df['city'])

res = pd.DataFrame(u, index, columns).T.drop_duplicates().T

res.reindex(allc, axis=1, fill_value=0)

             40_ZZ  50_ZZ  20_YY  50_XX  40_XX  20_ZZ  20_XX  10_YY  30_ZZ  30_YY  10_XX  30_XX  50_YY  40_YY  10_ZZ
choice city
10     XX        0      0      0      0      0      0      0      0      0      0      1      0      0      0      0
20     YY        0      0      1      0      0      0      0      0      0      0      0      0      0      0      0
       YY        0      0      1      0      0      0      0      0      0      0      0      0      0      0      0
30     XX        0      0      0      0      0      0      0      0      0      0      0      1      0      0      0
10     XX        0      0      0      0      0      0      0      0      0      0      1      0      0      0      0
20     YY        0      0      1      0      0      0      0      0      0      0      0      0      0      0      0
40     ZZ        1      0      0      0      0      0      0      0      0      0      0      0      0      0      0
       ZZ        1      0      0      0      0      0      0      0      0      0      0      0      0      0      0
50     YY        0      0      0      0      0      0      0      0      0      0      0      0      1      0      0