ValueError:找到样本数不一致的输入变量:[7111,1778]

时间:2018-05-13 00:44:58

标签: python-3.x

我也尝试重塑X(8889,17)和y(8889,1),但根本没有帮助:

import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors, model_selection

songs_dataset = pd.read_json('MasterSongList.json')

songs_dataset.loc[:,'genres'] = songs_dataset['genres'].apply(''.join)
def consolidateGenre(genre):
    if len(genre)>0:
        return genre.split(':')[0]
    else: return genre

songs_dataset.loc[:, 'genres'] = songs_dataset['genres'].apply(consolidateGenre)

audio_feature_list = [audio_feature for audio_feature in songs_dataset['audio_features']]
audio_features_headers = ['key','energy','liveliness','tempo','speechiness','acousticness','instrumentalness','time_signature'
                         ,'duration','loudness','valence','danceability','mode','time_signature_confidence','tempo_confidence'
                         ,'key_confidence','mode_confidence']
audio_features = pd.DataFrame(audio_feature_list, columns=audio_features_headers)
audio_features.loc[:,].dropna(axis=0,how='all',inplace=True)
audio_features['genres'] = songs_dataset['genres']

rock_rap = audio_features.loc[(audio_features['genres'] == 'rock') | (audio_features['genres'] == 'rap')]
rock_rap.reset_index(drop=True)

label_genres = np.array(rock_rap['genres']).reshape((len(label_genres),1))
final_features = rock_rap.drop('genres',axis = 1).astype(float)
final_features['speechiness'].fillna(final_features['speechiness'].mean(),inplace=True)

knn = neighbors.KNeighborsClassifier(n_neighbors = 3)
standard_scaler = preprocessing.StandardScaler()
final_features = standard_scaler.fit_transform(final_features)

X_train, y_train, X_test, y_test = cross_validation.train_test_split(final_features,label_genres,test_size=0.2)

knn.fit(X_train,y_train)

ValueError:找到样本数不一致的输入变量:[7111,1778]

1 个答案:

答案 0 :(得分:1)

您的问题是您错误地分配了train_test_split的结果,因此您尝试将模型放在X_trainX_test上,而不是您认为的那样测试。请改用:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(final_features,label_genres,test_size=0.2)

顺便提一下,如果你看一下应该给你一个提示的样本数量,因为7111几乎是1778(0.8 / 0.2 = 4)大小的四倍。