矢量化基于列组合乘法多个数据帧

时间:2018-03-15 14:59:15

标签: python pandas numpy

假设我们将数据框设置如下:

df1 = pd.DataFrame(np.random.randint(0, 2, (10, 2)), columns=['Cow', 'Sheep'])
df2 = pd.DataFrame(np.random.randint(0, 2, (10, 5)), columns=['Hungry', 'Scared', 'Happy', 'Bored', 'Sad'])
df3 = pd.DataFrame(np.random.randint(0, 2, (10, 2)), columns=['Davids', 'Michaels'])
df1.index.name = df2.index.name = df3.index.name = 'id'

combos_to_test = pd.DataFrame([('Davids', 'Cow', 'Hungry'),
                               ('Michaels', 'Cow', 'Hungry'),
                               ('Davids', 'Cow', 'Scared'),
                               ('Michaels', 'Cow', 'Scared'),
                               ('Michaels', 'Sheep', 'Scared'),
                               ('Davids', 'Sheep', 'Happy'),
                               ('Michaels', 'Sheep', 'Happy'),])

示例:

   DF1:           DF2:                                               DF3:
   id Cow Sheep    id   Hungry  Scared  Happy   Bored   Sad          id    Davids   Michaels            
    0   0   1       0     0       1        1       0    1            0      1          0  
    1   0   0       1     1       0        0       1    1            1      0          1  
    2   0   0       2     1       0        0       1    1            2      0          0  
    3   1   0       3     0       0        1       0    1            3      0          1  
    4   1   0       4     0       0        1       1    0            4      0          1  
    5   1   1       5     0       0        1       1    0            5      1          0  
    6   1   1       6     1       0        1       1    0            6      1          0  
    7   1   0       7     1       1        1       1    0            7      1          1  
    8   1   1       8     1       1        1       1    0            8      1          0  
    9   1   0       9     0       1        1       0    0            9      1          0    

我需要第4个数据帧,当每个combos_to_test为列时,它会找到(对于每个组合)。

我计划这样做的方法是将列更改为:

df1.columns = Cow, Cow, Cow, Cow, Sheep, Sheep, Sheep
df2.columns = Hungry, Hungry, Scared, Scared, Happy, Happy
df3.columns = David, Michael, David, Michael, Michael, David, Michael

然后将所有cols重命名为col1, col2, col3, ..., col8

然后将每个数据帧相乘(它将向量化它 - 但需要大量内存)。

我的数据集显然要大得多,并且会使用numpy / pandas。

输出df应如下所示:

  ('Davids', 'Cow', 'Hungry') | ('Michaels', 'Cow', 'Hungry') | ('Davids', 'Cow', 'Scared') | ('Michaels', 'Cow', 'Scared') | ...
 1)         0                               1                             0                                 0
 2)         0                               0                             0                                 0
 3)         0                               1                             0                                 0
 4)         0                               0                             1                                 0
 5)         0                               0                             0                                 0
 6)         0                               0                             0                                 0
 7)         0                               0                             0                                 0
 8)         0                               0                             1                                 1
 9)         1                               0                             0                                 0
10)         1                               0                             0                                 0

2 个答案:

答案 0 :(得分:3)

我可以使用pd.concat

执行此操作
df = pd.concat([df1, df2, df3], axis=1)

pd.concat({
    ctt: df.reindex(columns=ctt).prod(1)
    for ctt in map(tuple, combos_to_test.values)
}, axis=1)

   Davids              Michaels                    
      Cow        Sheep      Cow        Sheep       
   Hungry Scared Happy   Hungry Scared Happy Scared
id                                                 
0       0      0     0        0      0     0      0
1       1      1     0        1      1     0      1
2       0      0     0        0      0     0      0
3       0      0     0        0      0     1      0
4       1      1     0        0      0     0      0
5       0      0     0        1      1     1      1
6       0      0     0        0      0     0      0
7       0      0     0        0      0     0      0
8       0      0     0        0      0     0      0
9       0      0     0        0      0     0      0

答案 1 :(得分:0)

复制列的最简单方法是使用:

df1['Cow_copy'] = df1['Cow']

如果要复制许多列,可以创建列列表并循环遍历它并使用上面的代码为每个列。