将两行的值乘以两个不同的数据帧

时间:2017-11-15 07:23:00

标签: python pandas dataframe genetic-algorithm

我正在构建遗传算法以在python中进行特征选择。我从我的数据中提取了特征,然后我分成了两个数据帧,“训练”和“测试”数据帧。 如何在“人口”数据框(每个单独)和“训练”数据框中为每一行复用多个值?

'train'数据框:

   feature0   feature1   feature2   feature3   feature4   feature5
0  18.279579  -3.921346  13.611829  -7.250185 -11.773605 -18.265003   
1  17.899545 -15.503942  -0.741729  -0.053619  -6.734652   4.398419   
4  16.432750 -22.490190  -4.611659 -15.247781 -13.941488  -2.433374   
5  15.905368  -4.812785  18.291712   3.742221   3.631887  -1.074326   
6  16.991823 -15.946251   8.299577   8.057511   8.057510  -1.482333

'人口'数据框:

      0     1     2     3     4     5     
0     1     1     0     0     0     1     
1     0     1     0     1     0     0     
2     0     0     0     0     0     1     
3     0     0     1     0     1     1

将'population'中的每一行乘以'train'中的所有行。 结果如下:

1)来自人口第1行:

   feature0   feature1   feature2   feature3   feature4   feature5
0  18.279579  -3.921346          0          0          0 -18.265003   
1  17.899545 -15.503942          0          0          0   4.398419   
4  16.432750 -22.490190          0          0          0  -2.433374   
5  15.905368  -4.812785          0          0          0  -1.074326   
6  16.991823 -15.946251          0          0          0  -1.482333

2)来自人口第2行:

   feature0   feature1   feature2   feature3   feature4   feature5
0          0  -3.921346          0  -7.250185          0          0
1          0 -15.503942          0  -0.053619          0          0   
4          0 -22.490190          0 -15.247781          0          0   
5          0  -4.812785          0   3.742221          0          0   
6          0 -15.946251          0   8.057511          0          0

等等......

3 个答案:

答案 0 :(得分:4)

如果需要循环(如果大数据则缓慢):

for i, x in population.iterrows():
    print (train * x.values)

    feature0   feature1  feature2  feature3  feature4   feature5
0  18.279579  -3.921346       0.0      -0.0      -0.0 -18.265003
1  17.899545 -15.503942      -0.0      -0.0      -0.0   4.398419
4  16.432750 -22.490190      -0.0      -0.0      -0.0  -2.433374
5  15.905368  -4.812785       0.0       0.0       0.0  -1.074326
6  16.991823 -15.946251       0.0       0.0       0.0  -1.482333
   feature0   feature1  feature2   feature3  feature4  feature5
0       0.0  -3.921346       0.0  -7.250185      -0.0      -0.0
1       0.0 -15.503942      -0.0  -0.053619      -0.0       0.0
4       0.0 -22.490190      -0.0 -15.247781      -0.0      -0.0
5       0.0  -4.812785       0.0   3.742221       0.0      -0.0
6       0.0 -15.946251       0.0   8.057511       0.0      -0.0
   feature0  feature1  feature2  feature3  feature4   feature5
0       0.0      -0.0       0.0      -0.0      -0.0 -18.265003
1       0.0      -0.0      -0.0      -0.0      -0.0   4.398419
4       0.0      -0.0      -0.0      -0.0      -0.0  -2.433374
5       0.0      -0.0       0.0       0.0       0.0  -1.074326
6       0.0      -0.0       0.0       0.0       0.0  -1.482333
   feature0  feature1   feature2  feature3   feature4   feature5
0       0.0      -0.0  13.611829      -0.0 -11.773605 -18.265003
1       0.0      -0.0  -0.741729      -0.0  -6.734652   4.398419
4       0.0      -0.0  -4.611659      -0.0 -13.941488  -2.433374
5       0.0      -0.0  18.291712       0.0   3.631887  -1.074326
6       0.0      -0.0   8.299577       0.0   8.057510  -1.482333

或者每一行分开:

print (train * population.values[0])

    feature0   feature1  feature2  feature3  feature4   feature5
0  18.279579  -3.921346       0.0      -0.0      -0.0 -18.265003
1  17.899545 -15.503942      -0.0      -0.0      -0.0   4.398419
4  16.432750 -22.490190      -0.0      -0.0      -0.0  -2.433374
5  15.905368  -4.812785       0.0       0.0       0.0  -1.074326
6  16.991823 -15.946251       0.0       0.0       0.0  -1.482333

或者对于MultiIndex DataFrame:

d = pd.concat([train * population.values[i] for i in range(population.shape[0])],
               keys=population.index.tolist())
print (d)

      feature0   feature1   feature2   feature3   feature4   feature5
0 0  18.279579  -3.921346   0.000000  -0.000000  -0.000000 -18.265003
  1  17.899545 -15.503942  -0.000000  -0.000000  -0.000000   4.398419
  4  16.432750 -22.490190  -0.000000  -0.000000  -0.000000  -2.433374
  5  15.905368  -4.812785   0.000000   0.000000   0.000000  -1.074326
  6  16.991823 -15.946251   0.000000   0.000000   0.000000  -1.482333
1 0   0.000000  -3.921346   0.000000  -7.250185  -0.000000  -0.000000
  1   0.000000 -15.503942  -0.000000  -0.053619  -0.000000   0.000000
  4   0.000000 -22.490190  -0.000000 -15.247781  -0.000000  -0.000000
  5   0.000000  -4.812785   0.000000   3.742221   0.000000  -0.000000
  6   0.000000 -15.946251   0.000000   8.057511   0.000000  -0.000000
2 0   0.000000  -0.000000   0.000000  -0.000000  -0.000000 -18.265003
  1   0.000000  -0.000000  -0.000000  -0.000000  -0.000000   4.398419
  4   0.000000  -0.000000  -0.000000  -0.000000  -0.000000  -2.433374
  5   0.000000  -0.000000   0.000000   0.000000   0.000000  -1.074326
  6   0.000000  -0.000000   0.000000   0.000000   0.000000  -1.482333
3 0   0.000000  -0.000000  13.611829  -0.000000 -11.773605 -18.265003
  1   0.000000  -0.000000  -0.741729  -0.000000  -6.734652   4.398419
  4   0.000000  -0.000000  -4.611659  -0.000000 -13.941488  -2.433374
  5   0.000000  -0.000000  18.291712   0.000000   3.631887  -1.074326
  6   0.000000  -0.000000   8.299577   0.000000   8.057510  -1.482333

并按xs选择:

print (d.xs(0))

    feature0   feature1  feature2  feature3  feature4   feature5
0  18.279579  -3.921346       0.0      -0.0      -0.0 -18.265003
1  17.899545 -15.503942      -0.0      -0.0      -0.0   4.398419
4  16.432750 -22.490190      -0.0      -0.0      -0.0  -2.433374
5  15.905368  -4.812785       0.0       0.0       0.0  -1.074326
6  16.991823 -15.946251       0.0       0.0       0.0  -1.482333

答案 1 :(得分:2)

我使用numpy广播一次性完成所有操作......

train_ = pd.DataFrame(
    (train.values * pop.values[:, None]).reshape(-1, train.shape[1]),
    pd.MultiIndex.from_product([pop.index, train.index]),
    train.columns
)

train_

      feature0   feature1   feature2   feature3   feature4   feature5
0 0  18.279579  -3.921346   0.000000  -0.000000  -0.000000 -18.265003
  1  17.899545 -15.503942  -0.000000  -0.000000  -0.000000   4.398419
  4  16.432750 -22.490190  -0.000000  -0.000000  -0.000000  -2.433374
  5  15.905368  -4.812785   0.000000   0.000000   0.000000  -1.074326
  6  16.991823 -15.946251   0.000000   0.000000   0.000000  -1.482333
1 0   0.000000  -3.921346   0.000000  -7.250185  -0.000000  -0.000000
  1   0.000000 -15.503942  -0.000000  -0.053619  -0.000000   0.000000
  4   0.000000 -22.490190  -0.000000 -15.247781  -0.000000  -0.000000
  5   0.000000  -4.812785   0.000000   3.742221   0.000000  -0.000000
  6   0.000000 -15.946251   0.000000   8.057511   0.000000  -0.000000
2 0   0.000000  -0.000000   0.000000  -0.000000  -0.000000 -18.265003
  1   0.000000  -0.000000  -0.000000  -0.000000  -0.000000   4.398419
  4   0.000000  -0.000000  -0.000000  -0.000000  -0.000000  -2.433374
  5   0.000000  -0.000000   0.000000   0.000000   0.000000  -1.074326
  6   0.000000  -0.000000   0.000000   0.000000   0.000000  -1.482333
3 0   0.000000  -0.000000  13.611829  -0.000000 -11.773605 -18.265003
  1   0.000000  -0.000000  -0.741729  -0.000000  -6.734652   4.398419
  4   0.000000  -0.000000  -4.611659  -0.000000 -13.941488  -2.433374
  5   0.000000  -0.000000  18.291712   0.000000   3.631887  -1.074326
  6   0.000000  -0.000000   8.299577   0.000000   8.057510  -1.482333

您只能访问与第i行对应的那个或populationtrain_.loc[i]

train_.loc[3]

   feature0  feature1   feature2  feature3   feature4   feature5
0       0.0      -0.0  13.611829      -0.0 -11.773605 -18.265003
1       0.0      -0.0  -0.741729      -0.0  -6.734652   4.398419
4       0.0      -0.0  -4.611659      -0.0 -13.941488  -2.433374
5       0.0      -0.0  18.291712       0.0   3.631887  -1.074326
6       0.0      -0.0   8.299577       0.0   8.057510  -1.482333

粗略的时间测试
我懒得做更强大的测试

%%timeit
pd.DataFrame(
    (train.values * pop.values[:, None]).reshape(-1, train.shape[1]),
    pd.MultiIndex.from_product([pop.index, train.index]),
    train.columns
)

%%timeit
res = pop.iloc[np.repeat(np.arange(len(pop)), len(train))]
res = res.set_index(np.tile(train.index, len(pop)), append=True).add_prefix('feature')
res.mul(train, level=1)

%%timeit
pd.concat([train * pop.values[i] for i in range(pop.shape[0])],
               keys=pop.index.tolist())

571 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.42 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.7 ms ± 69.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

答案 2 :(得分:1)

population的列设置为与train匹配后,您可以使用*

In [11]: population.columns = train.columns

In [12]: train * population.iloc[0]
Out[12]:
    feature0   feature1  feature2  feature3  feature4   feature5
0  18.279579  -3.921346       0.0      -0.0      -0.0 -18.265003
1  17.899545 -15.503942      -0.0      -0.0      -0.0   4.398419
4  16.432750 -22.490190      -0.0      -0.0      -0.0  -2.433374
5  15.905368  -4.812785       0.0       0.0       0.0  -1.074326
6  16.991823 -15.946251       0.0       0.0       0.0  -1.482333

您可以使用np.tilenp.repeat非常有效地制作MultiIndex(由@jezrael推荐):

In [11]: res = population.iloc[np.repeat(np.arange(len(population)), len(train))]

In [12]: res = res.set_index(np.tile(train.index, len(population)), append=True)

In [13]: res
Out[13]:
     feature0  feature1  feature2  feature3  feature4  feature5
0 0         1         1         0         0         0         1
  1         1         1         0         0         0         1
  4         1         1         0         0         0         1
  5         1         1         0         0         0         1
  6         1         1         0         0         0         1
1 0         0         1         0         1         0         0
  1         0         1         0         1         0         0
  4         0         1         0         1         0         0
  5         0         1         0         1         0         0
  6         0         1         0         1         0         0
2 0         0         0         0         0         0         1
  1         0         0         0         0         0         1
  4         0         0         0         0         0         1
  5         0         0         0         0         0         1
  6         0         0         0         0         0         1
3 0         0         0         1         0         1         1
  1         0         0         1         0         1         1
  4         0         0         1         0         1         1
  5         0         0         1         0         1         1
  6         0         0         1         0         1         1

In [14]: res.mul(train, level=1)
Out[14]:
      feature0   feature1   feature2   feature3   feature4   feature5
0 0  18.279579  -3.921346   0.000000  -0.000000  -0.000000 -18.265003
  1  17.899545 -15.503942  -0.000000  -0.000000  -0.000000   4.398419
  4  16.432750 -22.490190  -0.000000  -0.000000  -0.000000  -2.433374
  5  15.905368  -4.812785   0.000000   0.000000   0.000000  -1.074326
  6  16.991823 -15.946251   0.000000   0.000000   0.000000  -1.482333
1 0   0.000000  -3.921346   0.000000  -7.250185  -0.000000  -0.000000
  1   0.000000 -15.503942  -0.000000  -0.053619  -0.000000   0.000000
  4   0.000000 -22.490190  -0.000000 -15.247781  -0.000000  -0.000000
  5   0.000000  -4.812785   0.000000   3.742221   0.000000  -0.000000
  6   0.000000 -15.946251   0.000000   8.057511   0.000000  -0.000000
2 0   0.000000  -0.000000   0.000000  -0.000000  -0.000000 -18.265003
  1   0.000000  -0.000000  -0.000000  -0.000000  -0.000000   4.398419
  4   0.000000  -0.000000  -0.000000  -0.000000  -0.000000  -2.433374
  5   0.000000  -0.000000   0.000000   0.000000   0.000000  -1.074326
  6   0.000000  -0.000000   0.000000   0.000000   0.000000  -1.482333
3 0   0.000000  -0.000000  13.611829  -0.000000 -11.773605 -18.265003
  1   0.000000  -0.000000  -0.741729  -0.000000  -6.734652   4.398419
  4   0.000000  -0.000000  -4.611659  -0.000000 -13.941488  -2.433374
  5   0.000000  -0.000000  18.291712   0.000000   3.631887  -1.074326
  6   0.000000  -0.000000   8.299577   0.000000   8.057510  -1.482333