我有大约100个样本,有35个特征。我需要从一系列特征中找到最佳特征。这些特征是患有特定类型癌症的患者的临床和基因组细节。我需要找到什么是最好的组合给出高ACC。
这是我的代码:
import sys
import csv
import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
with open("feature matrix", "r") as f:
for line in f:
entries = line.split(',')
for i in entries:
if i.decode('utf-8')[0]==u'-':
print 0-float(i.decode('utf-8')[1:])
X = i.decode('utf-8')
y='survival'
X.shape
y.shape
X_train, X_test, y_train, y_test= train_test_split(X, y,
stratify=y,
test_size=0.3,
random_state=1)
knn = KNeighborsClassifier(n_neighbors=2)
sfs1 = SFS(estimator=knn,
k_features=(3,35),
forward=True,
floating=False,
scoring='accuracy',
print_progress=False,
cv=5)
pipe = make_pipeline(StandardScaler(), sfs1)
pipe.fit(x_train,y_train)
print('best combination (ACC: %.3f): %s\n' %
(sfs1.k_score_,sfs1.k_feature_idx_))
当我运行代码时,我收到以下错误: 回溯(最近一次调用最后一次):
File "sample_test.py", line 21, in <module>
X.shape
AttributeError: 'unicode' object has no attribute 'shape'
我还没有足够的经验来了解故障排除的下一步是什么。有更好的方法来编写代码吗?