Numpy的分割效果和交叉验证效果之间的区别

时间:2019-04-10 16:43:43

标签: python machine-learning scikit-learn deep-learning cross-validation

我正在使用深度学习进行二进制文本分类。 拆分火车并使用正常的Numpy拆分为80-20进行测试时,模型的性能约为85%-90%(平均运行相同代码20次后的平均值)。

当我通过手动执行10次交叉验证运行同一模型时,性能会下降到65%-70%。 每次折叠后,准确性,召回率,精确度和F1-得分约为65-70%。

由Numpy执行的拆分和在交叉验证中手动拆分产生的拆分是否有重大区别?

Numpy Split :

train_NT, validate_NT, test_NT = np.split(NT, [int(.8*len(NT)), int(.9*len(NT))])
train_T, validate_T, test_T = np.split(T, [int(.8*len(T)), int(.9*len(T))])
train = train_NT.append(train_T)
valid = validate_NT.append(validate_T)
test = test_NT.append(test_T)

Manual Cross Validation :

def crossvalidation(obj,X,Y,folds):
    acc=[]
    precision=[]
    recall=[]
    f1score=[]
    Xarr=np.asarray(X)
    Yarr=np.asarray(Y)
    if len(Xarr)==len(Yarr):
        for i in range(folds-1):
            if i==0 or i==folds-2:
                print('cross flod validation fold '+ '{}'.format(i+1))
                aX,bX=np.split(Xarr,[int(len(Xarr)*(i+1)/folds)])
                aY,bY=np.split(Yarr,[int(len(Yarr)*(i+1)/folds)])
                if len(aX)>len(bX):
                    X_train,X_test,Y_train,Y_test=aX,bX,aY,bY
                else:
                    X_train,X_test,Y_train,Y_test=bX,aX,bY,aY
                prediction=eval('{}'.format(obj)+'model(X_train,X_test,Y_train)')
                jj=0
                for i in range(len(list(prediction))):
                    if (list(prediction)[i]==list(Y_test)[i]):
                        jj+=1
                acc.append(float(jj)/len(predicted))
                precision.append(precision_score(y_pred=list(prediction),y_true=list(Y_test),average='weighted'))
                recall.append(recall_score(y_pred=list(prediction),y_true=list(Y_test),average='weighted'))
                f1score.append(f1_score(y_pred=list(prediction),y_true=list(Y_test),average='weighted'))
            else:
                print('cross flod validation fold '+ '{}'.format(i+1))
                aX,bX,cX=np.split(Xarr,[int(len(Xarr)*(i)/folds),int(len(Xarr)*(i+1)/folds)])
                aY,bY,cY=np.split(Yarr,[int(len(Xarr)*(i)/folds),int(len(Xarr)*(i+1)/folds)])
                aX,aY=np.concatenate((aX,cX),axis=0),np.concatenate((aY,cY),axis=0)
                X_train,X_test,Y_train,Y_test=aX,bX,aY,bY
                prediction=eval('{}'.format(obj)+'model(X_train,X_test,Y_train)')
                jj=0
                for i in range(len(list(prediction))):
                    if (list(prediction)[i]==list(Y_test)[i]):
                        jj+=1
                acc.append(float(jj)/len(predicted))
                precision.append(precision_score(y_pred=list(prediction),y_true=list(Y_test),average='weighted'))
                recall.append(recall_score(y_pred=list(prediction),y_true=list(Y_test),average='weighted'))
                f1score.append(f1_score(y_pred=list(prediction),y_true=list(Y_test),average='weighted'))
        pre=sum(precision)/float(len(precision))
        rec=sum(recall)/float(len(recall))
        f1=sum(f1score)/float(len(f1score))
        ac=sum(acc)/float(len(acc))
        return(acc,pre,rec,f1)
    else:
        raise ValueError ('length of X and Y is not equal')

0 个答案:

没有答案