多个二进制列到一列

时间:2017-05-16 11:19:48

标签: python pandas numpy scikit-learn

我有一个包含21列的CSV文件数据集,前10列是数字,我不想更改它们。接下来的10列是二进制数据,其中只包含1和0,其中一列是#34; 1"其他的是" 0",最后一列是给定的标签。

示例数据如下所示

2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)

假设我将数据加载到矩阵中,我可以保持前10列和最后一列不变,并将中间10列转换为一列吗?转换后,我希望列值基于" 1"的索引。在行中,如上面的行,想要的结果是

2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total

我可以使用NumPy,scikit-learn或其他方式实现这一目标吗?

4 个答案:

答案 0 :(得分:2)

如果将其加载到out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]数组

,则应该这样做

{{1}}

答案 1 :(得分:1)

from io import StringIO

import pandas as pd

csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
               "\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")

df = pd.read_csv(csv, header=None)

df = pd.concat(objs=[df[df.columns[:11]],
                     df[df.columns[11:-1]].idxmax(axis=1) - 10,
                     df[df.columns[-1]]], axis=1)

print(df)

输出:

     0   1   2    3   4    5    6    7    8     9   10  0   21
0  2596  51   3  258   0  510  221  232  148  6279  24   6   2
1     1   2   3    4   5    6    7    8    9    10  11   5   1

答案 2 :(得分:0)

数据:

In [135]: df
Out[135]:
     0   1   2    3   4    5    6    7    8     9  ...  12  13  14  15  16  17  18  19  20  21
0  2596  51   3  258   0  510  221  232  148  6279 ...   0   0   0   0   1   0   0   0   0   2
1  2596  51   3  258   0  510  221  232  148  6279 ...   0   0   0   0   0   0   0   0   1   2

[2 rows x 22 columns]

<强>解决方案:

df = pd.read_csv('/path/to/file.csv', header=None)

In [137]: df.iloc[:, :11] \
            .join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \
            .join(df.iloc[:, -1])
Out[137]:
     0   1   2    3   4    5    6    7    8     9   10  11  21
0  2596  51   3  258   0  510  221  232  148  6279  24   6   2
1  2596  51   3  258   0  510  221  232  148  6279  24  10   2

答案 3 :(得分:0)

<强>设置

df = pd.DataFrame({0: {2596: 51},
 1: {2596: 3},
 2: {2596: 258},
 3: {2596: 0},
 4: {2596: 510},
 5: {2596: 221},
 6: {2596: 232},
 7: {2596: 148},
 8: {2596: 6279},
 9: {2596: 24},
 10: {2596: 0},
 11: {2596: 0},
 12: {2596: 0},
 13: {2596: 0},
 14: {2596: 0},
 15: {2596: 1},
 16: {2596: 0},
 17: {2596: 0},
 18: {2596: 0},
 19: {2596: 0},
 20: {2596: 2}})

<强>解决方案

#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1

#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]

Out[2167]: 
      0   1    2   3    4    5    6    7     8   9   10  20
2596  51   3  258   0  510  221  232  148  6279  24   6   2