我想用这段代码将数值数据标准化为特征向量:
import numpy as np
import pandas as pd
import csv
def clearRegister():
clear_register = []
zero = 0
for i in range(21):
clear_register.append(0)
return clear_register
def header():
clear_register = []
name = 'c'
entry = 1
for i in range(21):
clear_register.append(name+str(entry))
entry += 1
return clear_register
def convert(filename):
clear_dataset = []
clear_dataset.append(header())
with open(filename) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
clear_register = clearRegister()
clear_register[(int(row["blue1"])-1)] = 1
clear_register[(int(row["blue2"])-1)] = 1
clear_register[(int(row["blue3"])-1)] = 1
clear_register[(int(row["red1"])+9)] = 1
clear_register[(int(row["red2"])+9)] = 1
clear_register[(int(row["red3"])+9)] = 1
这是我的csvfile输入:
row blue1 blue2 blue3 red1 red2 red3 lable
0 1 5 4 6 2 8 0
1 2 3 1 9 4 5 1
. . . . . . . .
3000 5 7 4 3 8 10 1
我希望输出像这样(蓝色为c1-c10,红色为c11-c20):
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable
1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0
1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1
. . . . . . . . . . . . . . . . . . . . .
0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1
c11 - c20是c1 - c10的'红色'表示,并且所有这些都是唯一的。如果c1,c5,c10的值为1,则c11,c15,c20不能具有该值。
我试着用它来打电话:
df = convert("dataset.csv")
df1 = pd.DataFrame(df)
print(df1)
我得到了这个结果:
Empty DataFrame
Columns: []
Index: []
代码有问题或缺少吗?
答案 0 :(得分:1)
考虑一只大熊猫'解决方案而不是使用loc
来迭代创建新的 c1-c20 列的csv操作。下面用随机数据进行演示:
数据 (仅适用于OP使用实际csv的问题读者)
import numpy as np
import pandas as pd
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 25)
np.random.seed(5005)
df = pd.DataFrame({'row': range(3000),
'blue1': [np.random.randint(11) for _ in range(3000)],
'blue2': [np.random.randint(11) for _ in range(3000)],
'blue3': [np.random.randint(11) for _ in range(3000)],
'red1': [np.random.randint(11) for _ in range(3000)],
'red2': [np.random.randint(11) for _ in range(3000)],
'red3': [np.random.randint(11) for _ in range(3000)],
'lable': [0,1]*1500})
print(df.head())
# blue1 blue2 blue3 lable red1 red2 red3 row
# 0 4 5 5 0 10 0 8 0
# 1 7 2 2 1 3 8 8 1
# 2 2 4 0 0 8 1 7 2
# 3 4 5 8 1 9 8 1 3
# 4 0 1 5 0 5 6 9 4
<强>过程强>
for i in range(1,11):
df.loc[(df['blue1'] == i) | (df['blue2'] == i) | (df['blue3'] == i), 'c'+str(i)] = 1
df.loc[(df['red1'] == i) | (df['red2'] == i) | (df['red3'] == i), 'c'+str(i+10)] = 1
# SELECT AND RE-ORDER COLUMNS, FILL IN NANs, CONVERT TO INT TYPE
df = df[['c'+str(i) for i in range(1,21)]+['lable']].fillna(0).astype(int)
print(df.head())
# c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable
# 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
# 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1
# 2 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0
# 3 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1
# 4 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0