我也尝试重塑X(8889,17)和y(8889,1),但根本没有帮助:
import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors, model_selection
songs_dataset = pd.read_json('MasterSongList.json')
songs_dataset.loc[:,'genres'] = songs_dataset['genres'].apply(''.join)
def consolidateGenre(genre):
if len(genre)>0:
return genre.split(':')[0]
else: return genre
songs_dataset.loc[:, 'genres'] = songs_dataset['genres'].apply(consolidateGenre)
audio_feature_list = [audio_feature for audio_feature in songs_dataset['audio_features']]
audio_features_headers = ['key','energy','liveliness','tempo','speechiness','acousticness','instrumentalness','time_signature'
,'duration','loudness','valence','danceability','mode','time_signature_confidence','tempo_confidence'
,'key_confidence','mode_confidence']
audio_features = pd.DataFrame(audio_feature_list, columns=audio_features_headers)
audio_features.loc[:,].dropna(axis=0,how='all',inplace=True)
audio_features['genres'] = songs_dataset['genres']
rock_rap = audio_features.loc[(audio_features['genres'] == 'rock') | (audio_features['genres'] == 'rap')]
rock_rap.reset_index(drop=True)
label_genres = np.array(rock_rap['genres']).reshape((len(label_genres),1))
final_features = rock_rap.drop('genres',axis = 1).astype(float)
final_features['speechiness'].fillna(final_features['speechiness'].mean(),inplace=True)
knn = neighbors.KNeighborsClassifier(n_neighbors = 3)
standard_scaler = preprocessing.StandardScaler()
final_features = standard_scaler.fit_transform(final_features)
X_train, y_train, X_test, y_test = cross_validation.train_test_split(final_features,label_genres,test_size=0.2)
knn.fit(X_train,y_train)
ValueError:找到样本数不一致的输入变量:[7111,1778]
答案 0 :(得分:1)
您的问题是您错误地分配了train_test_split
的结果,因此您尝试将模型放在X_train
和X_test
上,而不是您认为的那样测试。请改用:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(final_features,label_genres,test_size=0.2)
顺便提一下,如果你看一下应该给你一个提示的样本数量,因为7111几乎是1778(0.8 / 0.2 = 4)大小的四倍。