我目前正在研究数据集(成人收入),它看起来像这样:
age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
正如您可能注意到的那样,他们是一些绝对的(工人阶级,教育阶段,婚姻状况等)和连续的(年龄,fnlwgt,教育数量等)属性。
目前,我设法通过qcut和pandas剪切创建连续值的类别,并通过为每个类创建一个attribut来创建分类值。但是,我真的不知道如何同时读取原始数据集并写入我的二值化数据集。我希望我的二值化数据集看起来像这样:
age_between_16_26,age_between_26_33,age_between_33_41,age_between_41_51,age_between_51_90,workclass_Private,workclass_Self-emp-not-inc,workclass_Self-emp-inc,workclass_Federal-gov,workclass_Local-gov,workclass_State-gov,workclass_Without-pay,workclass_Never-worked,fnlwgt_between_12284_79714,fnlwgt_between_79714_117550,fnlwgt_between_117550_151626,fnlwgt_between_151626_178144,fnlwgt_between_178144_200967,fnlwgt_between_200967_237642,fnlwgt_between_237642_308081,fnlwgt_between_308081_1490400,education_Bachelors,education_Some-college,education_11th,education_HS-grad,education_Prof-school,education_Assoc-acdm,education_Assoc-voc,education_9th,education_7th-8th,education_12th,education_Masters,education_1st-4th,education_10th,education_Doctorate,education_5th-6th,education_Preschool,educational-num_between_1_9,educational-num_between_9_10,educational-num_between_10_12,educational-num_between_12_16,marital-status_Married-civ-spouse,marital-status_Divorced,marital-status_Never-married,marital-status_Separated,marital-status_Widowed,marital-status_Married-spouse-absent,marital-status_Married-AF-spouse,occupation_Tech-support,occupation_Craft-repair,occupation_Other-service,occupation_Sales,occupation_Exec-managerial,occupation_Prof-specialty,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Adm-clerical,occupation_Farming-fishing,occupation_Transport-moving,occupation_Priv-house-serv,occupation_Protective-serv,occupation_Armed-Forces,relationship_Wife,relationship_Own-child,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Unmarried,race_White,race_Asian-Pac-Islander,race_Amer-Indian-Eskimo,race_Other,race_Black,gender,capital-gain_between_0_24999,capital-gain_between_24999_49999,capital-gain_between_49999_74999,capital-gain_between_74999_99999,capital-loss_between_0_1089,capital-loss_between_1089_2178,capital-loss_between_2178_3267,capital-loss_between_3267_4356,hours-per-week_between_0_40,hours-per-week_between_40_99,native-country_United-States,native-country_Cambodia,native-country_England,native-country_Puerto-Rico,native-country_Canada,native-country_Germany,native-country_Outlying-US(Guam-USVI-etc),native-country_India,native-country_Japan,native-country_Greece,native-country_South,native-country_China,native-country_Cuba,native-country_Iran,native-country_Honduras,native-country_Philippines,native-country_Italy,native-country_Poland,native-country_Jamaica,native-country_Vietnam,native-country_Mexico,native-country_Portugal,native-country_Ireland,native-country_France,native-country_Dominican-Republic,native-country_Laos,native-country_Ecuador,native-country_Taiwan,native-country_Haiti,native-country_Columbia,native-country_Hungary,native-country_Guatemala,native-country_Nicaragua,native-country_Scotland,native-country_Thailand,native-country_Yugoslavia,native-country_El-Salvador,native-country_Trinadad&Tobago,native-country_Peru,native-country_Hong,native-country_Holand-Netherlands,income
1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1 ... (I think you get it)
目前,我尝试手动执行此操作:
with open(path_reading,"r") as read:
with open(path_writing,"w") as write:
reader = csv.reader(read)
writer = csv.writer(write)
for row in reader:
line = ""
if(row[0] == "age"): continue
line = line + "1," if (int(row[0]) >= 17 and int(row[0]) <= 26) else line + "0,"
line = line + "1," if (int(row[0]) > 26 and int(row[0]) <= 33) else line + "0,"
line = line + "1," if (int(row[0]) > 33 and int(row[0]) <= 41) else line + "0,"
line = line + "1," if (int(row[0]) > 41 and int(row[0]) <= 51) else line + "0,"
line = line + "1" if (int(row[0]) > 51 and int(row[0]) <= 90) else line + "0"
line = line + "\n"
write.write(line)
//And things go on ...
但我想知道是否有更快的方法来做到这一点。