train_test_split不分割数据

时间:2018-07-01 17:58:13

标签: scikit-learn

一个数据框总共由14列组成,最后一列是目标标签,其整数值为0或1。

我已经定义-

  1. X = df.iloc [:,1:13] ----由特征值组成
  2. Ly = df.iloc [:,-1] ------由相应的标签组成

两者都具有所需的长度,X是由13列组成的数据框,形状为(159880,13),y是形状为(159880,)的数组类型

但是当我在X,y上执行train_test_split时-该功能无法正常工作。

下面是简单的代码-

X_train,y_train,X_test,y_test = train_test_split(X,y,random_state = 0)

此拆分之后,X_train和X_test都具有形状(119910,13)。 y_train的形状为(39970,13),y_test的形状为(39970,)

这很奇怪,即使在定义了test_size参数之后,结果仍然保持不变。

请告知,可能出了什么问题。

import pandas as pd

将numpy导入为np 从sklearn.tree导入DecisionTreeClassifier 从adspy_shared_utilities导入plot_feature_importances 从sklearn.model_selection导入train_test_split 从sklearn.linear_model导入LogisticRegression

def model():

df = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
df = df[np.isfinite(df['compliance'])]
df = df.fillna(0)
df['compliance'] = df['compliance'].astype('int')
df = df.drop(['grafitti_status', 'violation_street_number','violation_street_name','violator_name',
              'inspector_name','mailing_address_str_name','mailing_address_str_number','payment_status',
              'compliance_detail', 'collection_status','payment_date','disposition','violation_description',
              'hearing_date','ticket_issued_date','mailing_address_str_name','city','state','country',
              'violation_street_name','agency_name','violation_code'], axis=1)
df['violation_zip_code'] = df['violation_zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['zip_code'] = df['zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['non_us_str_code'] = df['non_us_str_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['violation_zip_code'] = pd.to_numeric(df['violation_zip_code'], errors='coerce')
df['zip_code'] = pd.to_numeric(df['zip_code'], errors='coerce')
df['non_us_str_code'] = pd.to_numeric(df['non_us_str_code'], errors='coerce')
#df.violation_zip_code = df.violation_zip_code.replace('-','', inplace=True)
df['violation_zip_code'] = np.nan_to_num(df['violation_zip_code'])
df['zip_code'] = np.nan_to_num(df['zip_code'])
df['non_us_str_code'] = np.nan_to_num(df['non_us_str_code'])
X = df.iloc[:,0:13]
y = df.iloc[:,-1]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)    
print(y_train.shape)

2 个答案:

答案 0 :(得分:2)

您混淆了train_test_split的结果,应该是

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)

答案 1 :(得分:0)

if args.mode == "train":

    # Load Data
    data, labels = load_dataset('C:/Users/PC/Desktop/train/k')

    # Train ML models
    knn(data, labels,'C:/Users/PC/Desktop/train/knn.pkl' )