数值数据转换为特征向量

时间:2017-10-05 11:40:12

标签: python pandas csv dataframe feature-extraction

我想用这段代码将数值数据标准化为特征向量:

import numpy as np
import pandas as pd
import csv

def clearRegister():
    clear_register = []
    zero = 0
    for i in range(21):
        clear_register.append(0)
    return clear_register

def header():
    clear_register = []
    name = 'c'
    entry = 1
    for i in range(21):
        clear_register.append(name+str(entry))
        entry += 1
    return clear_register

def convert(filename):
    clear_dataset = []
    clear_dataset.append(header())
    with open(filename) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            clear_register = clearRegister()
            clear_register[(int(row["blue1"])-1)] = 1
            clear_register[(int(row["blue2"])-1)] = 1
            clear_register[(int(row["blue3"])-1)] = 1
            clear_register[(int(row["red1"])+9)] = 1
            clear_register[(int(row["red2"])+9)] = 1
            clear_register[(int(row["red3"])+9)] = 1

这是我的csvfile输入:

row blue1 blue2 blue3 red1 red2 red3 lable
0 1 5 4 6 2 8 0
1 2 3 1 9 4 5 1
. . . . . . . .
3000 5 7 4 3 8 10 1

我希望输出像这样(蓝色为c1-c10,红色为c11-c20):

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable
 1  0  0  1  1  0  0  0   0  0   0   1   0   0   0   1   0   1   0   0  0
 1  1  1  0  0  0  0  0   0  0   0   0   0   1   1   0   0   0   1   0  1
 .  .  .  .  .  .  .  .   .  .   .   .   .   .   .   .   .   .   .   .  .
 0  0  0  1  1  0  1  0   0  0   1   0   0   0   0   0   0   1   0   1  1

c11 - c20是c1 - c10的'红色'表示,并且所有这些都是唯一的。如果c1,c5,c10的值为1,则c11,c15,c20不能具有该值。

我试着用它来打电话:

df = convert("dataset.csv")
df1 = pd.DataFrame(df)
print(df1)

我得到了这个结果:

Empty DataFrame
Columns: []
Index: []

代码有问题或缺少吗?

1 个答案:

答案 0 :(得分:1)

考虑一只大熊猫'解决方案而不是使用loc来迭代创建新的 c1-c20 列的csv操作。下面用随机数据进行演示:

数据 (仅适用于OP使用实际csv的问题读者)

import numpy as np
import pandas as pd

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 25)

np.random.seed(5005)
df = pd.DataFrame({'row': range(3000),
                   'blue1': [np.random.randint(11) for _ in range(3000)],
                   'blue2': [np.random.randint(11) for _ in range(3000)],
                   'blue3': [np.random.randint(11) for _ in range(3000)],
                   'red1': [np.random.randint(11) for _ in range(3000)],
                   'red2': [np.random.randint(11) for _ in range(3000)],
                   'red3': [np.random.randint(11) for _ in range(3000)],
                   'lable': [0,1]*1500})

print(df.head())
#    blue1  blue2  blue3  lable  red1  red2  red3  row
# 0      4      5      5      0    10     0     8    0
# 1      7      2      2      1     3     8     8    1
# 2      2      4      0      0     8     1     7    2
# 3      4      5      8      1     9     8     1    3
# 4      0      1      5      0     5     6     9    4

<强>过程

for i in range(1,11):    
    df.loc[(df['blue1'] == i) | (df['blue2'] == i) | (df['blue3'] == i), 'c'+str(i)] = 1
    df.loc[(df['red1'] == i) | (df['red2'] == i) | (df['red3'] == i), 'c'+str(i+10)] = 1

# SELECT AND RE-ORDER COLUMNS, FILL IN NANs, CONVERT TO INT TYPE
df = df[['c'+str(i) for i in range(1,21)]+['lable']].fillna(0).astype(int)

print(df.head())    
#    c1  c2  c3  c4  c5  c6  c7  c8  c9  c10  c11  c12  c13  c14  c15  c16  c17  c18  c19  c20  lable
# 0   0   0   0   1   1   0   0   0   0    0    0    0    0    0    0    0    0    1    0    1      0
# 1   0   1   0   0   0   0   1   0   0    0    0    0    1    0    0    0    0    1    0    0      1
# 2   0   1   0   1   0   0   0   0   0    0    1    0    0    0    0    0    1    1    0    0      0
# 3   0   0   0   1   1   0   0   1   0    0    1    0    0    0    0    0    0    1    1    0      1
# 4   1   0   0   0   1   0   0   0   0    0    0    0    0    0    1    1    0    0    1    0      0