我正在使用Sklearn的SVC来区分不同的矩阵。 数据是95个相关矩阵,根据精神分裂症患者(50个基质)和健康对照(45个基质)的IRM计算得出。它们相当大(264 * 264),所以我并没有期待完美的结果,但0%的准确度似乎非常低。
数据: 95矩阵264 * 264,值为[-1,1]
以下是代码:
## Datas
#control_matrices: list of 45 matrices
#patient_matrices: list of 50 matrices
n_training = 25 #Number of matrices of control to train SVC (25 control and 25 patient)
indices = np.triu_indices(264,1) #Since the matrices are symetric, I just take the upper triangle
perm_control = np.random.permutation(45) #Doing a permutation to take random matrices for training
contr_matrices = control_matrices[perm_control] #control_matrices is a list of matrices
perm_patient = np.random.permutation(50) #Same with the patient matrices
pat_matrices = patient_matrices[perm_patient]
x_control = [m[indices] for m in contr_matrices[:n_training]] #Data for training
x_patient = [m[indices] for m in pat_matrices[:n_training]]
test_control = [m[indices] for m in contr_matrices[n_training:]] #Data for test once the SVM is trained
test_patient = [m[indices] for m in pat_matrices[n_training:]]
X = np.concatenate((x_control, x_patient))
Y = np.asarray( n_training*[0] + n_training*[1] ) #Control: 0 - Patient: 1
perm = np.random.permutation(50)
X = X[perm]
Y = Y[perm]
## Training
clf = SVC()
clf.fit(X,Y)
由于数据的大小与矩阵的数量相比是巨大的,我预计会得到很低的结果(有点好于50%)。
clf.score(np.concatenate((test_control, test_patient)), 20*[0]+25*[1])
>>> 0.0
每当我运行代码时(因此,使用不同的排列)和n_training
从10到45都会发生同样的情况。但是SVC确实记得第一个矩阵,用于训练(clf.score(X,Y)
是1.0
)。
clf=LinearSVC()
和clf=LogisticRegression()
同样如此。
我也尝试了这个,结果完全相同:
from sklearn.cross_validation import StratifiedKFold, cross_val_score
from nilearn import connectome
connectivity_coefs = connectome.sym_to_vec(matrices, ConnectivityMeasure)
# This turns the matrices to a list of vectors
Y = 45*[0] + 50*[1]
cv = StratifiedKFold(Y, n_folds=3, shuffle=True)
svc = LinearSVC()
cv_scores = cross_val_score(svc, connectivity_coefs, Y, cv=cv, scoring='accuracy')
print('Score: %1.2f +- %1.2f' % (cv_scores.mean(), cv_scores.std()))
>>> Score: 0.00 +- 0.00
我还尝试使用更简单的数据:用于控制的矩阵[0]
和用于患者的[1]
。 SVC工作得很好,所以首先我怀疑它与我使用的矩阵的大小有关(大尺寸和少量样本)。
但是matrices = np.random.rand(95,264,264)
,我得Score: 0.58 +- 0.03
。
使用完整的矩阵而不是上三角形,我仍然可以获得0%的准确度。
我完全不明白这里发生了什么。
Windows-8-6.2.9200
Python 3.4.1 |Continuum Analytics, Inc.| (default, May 19 2014, 13:02:30) [MSC v.1600 64 bit (AMD64)]
NumPy 1.9.1
SciPy 0.15.1
Scikit-Learn 0.15.2
以下是获取我使用的矩阵的完整代码(来自开放数据集的IRM):
from nilearn import datasets
from nilearn import input_data
from nilearn.connectome import ConnectivityMeasure
import numpy as np
from sklearn.svm import SVC, LinearSVC
from sklearn.cross_validation import StratifiedKFold, cross_val_score
from nilearn import connectome
## Atlas for the parcellation and Dataset
power = datasets.fetch_coords_power_2011()
coords = np.vstack((power.rois['x'], power.rois['y'], power.rois['z'])).T
datas = datasets.fetch_cobre(n_subjects=None, verbose=0)
spheres_masker = input_data.NiftiSpheresMasker(
seeds=coords, smoothing_fwhm=4, radius=5.,
detrend=True, standardize=True,
high_pass=0.01, t_r=2, verbose=0)
## Extracting useful IRM
list_time_series = []
i = 0
for fmri_filenames, confounds_file in zip(datas.func, datas.confounds): #Might take a few minutes
print("Sujet %s" % i)
if i != 38 and i != 41: #Subjects removed from the study
conf = np.genfromtxt(confounds_file)
conf = np.delete(conf, obj = 16, axis = 1) #Remove Global Signal
conf = np.delete(conf, obj = 0, axis = 0) #Remove labels
scrub = [i for i in range(150) if conf[i,7]==1]
conf = np.delete(conf, obj = 7, axis = 1) #Remove Scrub
if len(scrub) < 90: #Keep at least 60 non scrub
time_series = spheres_masker.fit_transform(fmri_filenames, confounds=conf)
time_series = np.delete(time_series, obj = scrub, axis = 0) #Remove scrub
list_time_series.append(time_series)
else:
list_time_series.append([])
else:
list_time_series.append([])
i+=1
## Computing correlation matrices
N = len(datas.phenotypic)
control_subjects = []
patient_subjects = []
for i in range(N):
t = list_time_series[i]
if type(t) != list :
subject = datas.phenotypic[i]
if str(subject[4])=='b\'Control\'':
control_subjects.append(t)
else:
patient_subjects.append(t)
control_subjects = np.asarray(control_subjects)
patient_subjects = np.asarray(patient_subjects)
connect_measure = ConnectivityMeasure(kind='tangent')
control_matrices=connect_measure.fit_transform(control_subjects)
patient_matrices=connect_measure.fit_transform(patient_subjects)
matrices = np.concatenate((control_matrices, patient_matrices))
或者您可以下载here。
感谢您的帮助!
答案 0 :(得分:1)
您应该将输出标签分配给数字(例如["Control"]
为["Patient"]
和Control
,而不是使用0
和Patient
作为输出标签。 1
)因为ML算法只处理实数。
所以
Y = np.asarray( n_training*["Control"] + n_training*["Patient"] )
应该是
Y = np.asarray( n_training*[0] + n_training*[1] )
和
clf.score(np.concatenate((test_control, test_patient)), 20*['Control']+25*['Patient'])
应该是
clf.score(np.concatenate((test_control, test_patient)), np.asarray( 20*[0] + 25*[1] ))