Question

我有一个包含21列的CSV文件数据集，前10列是数字，我不想更改它们。接下来的10列是二进制数据，其中只包含1和0，其中一列是＃34; 1＆＃34;其他的是＆＃34; 0＆＃34;，最后一列是给定的标签。

示例数据如下所示

2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)

假设我将数据加载到矩阵中，我可以保持前10列和最后一列不变，并将中间10列转换为一列吗？转换后，我希望列值基于＆＃34; 1＆＃34;的索引。在行中，如上面的行，想要的结果是

2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total

我可以使用NumPy，scikit-learn或其他方式实现这一目标吗？

Answer 1

如果将其加载到out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]数组

，则应该这样做

Answer 2

from io import StringIO

import pandas as pd

csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
               "\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")

df = pd.read_csv(csv, header=None)

df = pd.concat(objs=[df[df.columns[:11]],
                     df[df.columns[11:-1]].idxmax(axis=1) - 10,
                     df[df.columns[-1]]], axis=1)

print(df)

输出：

     0   1   2    3   4    5    6    7    8     9   10  0   21
0  2596  51   3  258   0  510  221  232  148  6279  24   6   2
1     1   2   3    4   5    6    7    8    9    10  11   5   1

Answer 3

数据：

In [135]: df Out[135]: 0 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21 0 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 1 0 0 0 0 2 1 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 0 0 0 0 1 2 [2 rows x 22 columns]

<强>解决方案：

df = pd.read_csv('/path/to/file.csv', header=None) In [137]: df.iloc[:, :11] \ .join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \ .join(df.iloc[:, -1]) Out[137]: 0 1 2 3 4 5 6 7 8 9 10 11 21 0 2596 51 3 258 0 510 221 232 148 6279 24 6 2 1 2596 51 3 258 0 510 221 232 148 6279 24 10 2

Answer 4

<强>设置

df = pd.DataFrame({0: {2596: 51},
 1: {2596: 3},
 2: {2596: 258},
 3: {2596: 0},
 4: {2596: 510},
 5: {2596: 221},
 6: {2596: 232},
 7: {2596: 148},
 8: {2596: 6279},
 9: {2596: 24},
 10: {2596: 0},
 11: {2596: 0},
 12: {2596: 0},
 13: {2596: 0},
 14: {2596: 0},
 15: {2596: 1},
 16: {2596: 0},
 17: {2596: 0},
 18: {2596: 0},
 19: {2596: 0},
 20: {2596: 2}})

<强>解决方案

#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1

#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]

Out[2167]: 
      0   1    2   3    4    5    6    7     8   9   10  20
2596  51   3  258   0  510  221  232  148  6279  24   6   2

多个二进制列到一列

4 个答案: