ValueError:形状(831,18)和(1629,2)不对齐:18(dim 1)!= 1629(dim 0)

时间:2018-10-28 20:11:51

标签: python dataframe machine-learning scikit-learn

因此,我一直在尝试根据歌曲的歌词和其他参数(例如拍子等)对歌曲的流行程度进行分类。现在,这是我尝试通过tkinter运行的代码段。

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer,CountVectorizer

df = pd.read_csv(r'Dataset(Advanced)(processed lyrics).csv') 

df['Lyrics'] = df['Lyrics'].astype(str)   

mapper = DataFrameMapper([('Lyrics', CountVectorizer()),
  ('Tempo', None),
  ('Energy', None),
   ('Loudness', None),
  ('Danceability', None),
  ('Speechiness', None),
  ('Acousticness', None),
 ('Artist Hit', None)
 ])

features = mapper.fit_transform(df[['Lyrics', 'Tempo', 'Energy', 'Loudness', 'Danceability', 'Speechiness'
                                , 'Acousticness', 'Artist Hit']])
y = df['Hit']

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

model.fit(features, y)

现在,这是我单击按钮时调用的函数。在这里,我将一首歌的所有值(如歌词,节奏等)都转换为dataframe属性,以使其适合DataFrameMapper。尽管所有这些看起来都不错,

def predict():
user_Lyrics = lyricsTextBox2.get(1.0, "end-1c")
user_Lyrics = user_Lyrics.values.astype(str)
print(user_Lyrics.head())
print(type(user_Lyrics))

# Everything in lowercase
user_Lyrics = user_Lyrics.apply(lambda x: " ".join(x.lower() for x in str(x).split()))

# Removing punctuation that does not add meaning to the song
user_Lyrics = user_Lyrics.str.replace('[^\w\s]', '')

# Removing of stop words
from nltk.corpus import stopwords

stop = stopwords.words('english')
user_Lyrics = user_Lyrics.apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))

# Correction of Spelling mistakes
from textblob import TextBlob
user_Lyrics = user_Lyrics.apply(lambda x: str(TextBlob(x).correct()))

# Lemmatization is basically converting a word into its root word. It is preferred over Stemming.
from textblob import Word
user_Lyrics = user_Lyrics.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))



df['AP'] = float(ArtistPopularityEntry.get())
df['SE'] = float(EnergyEntry.get())
df['SL'] = float(LoudnessEntry.get())
df['SA'] = float(AcousticnessEntry.get())
df['ST'] = float(TempoEntry.get())
df['SD'] = float(DanceabilityEntry.get())
df['SS'] = float(SpeechinessEntry.get())

mapper2 = DataFrameMapper([
    ('Lyrics_User', CountVectorizer()),
    ('ST', None),
    ('SE', None),
    ('SL', None),
    ('SD', None),
    ('SS', None),
    ('SA', None),
    ('AP', None)
])
features2 = mapper2.fit_transform(df[['Lyrics_User', 'ST', 'SE', 'SL', 'SD', 'SS', 'SA', 'AP']])

print(type(features2))
print(len(features2))
print(features2.shape)

print(type(features))
print(len(features))
print(features.shape)

user_prediction = model.predict(features2)
print(user_prediction)
if (user_prediction[0] == 1):
    resultLabel2.config(text='Song is Hit')
else:
    resultLabel2.config(text='Song is not Hit')

输出:

<class 'numpy.ndarray'>
831
(831, 18)
<class 'numpy.ndarray'>
831
(831, 1629)

Error: 

Exception in Tkinter callback Traceback (most recent call last):   File "C:\Users\moksh\Anaconda3\lib\tkinter\__init__.py", line 1702, in
    __call__
        return self.func(*args)   File "<ipython-input-4-f6ddab248363>", 
line 69, in predict
        user_prediction = model.predict(features2)   File 
"C:\Users\moksh\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 
66, in predict
        jll = self._joint_log_likelihood(X)   File 
"C:\Users\moksh\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 
725, in _joint_log_likelihood
        return (safe_sparse_dot(X, self.feature_log_prob_.T) +   File 
"C:\Users\moksh\Anaconda3\lib\site-packages\sklearn\utils\extmath.py", 
line 140, in safe_sparse_dot
        return np.dot(a, b) ValueError: shapes (831,18) and (1629,2) not 
aligned: 18 (dim 1) != 1629 (dim 0)

编辑

 df['AP'] = float(ArtistPopularityEntry.get())
 df['SE'] = float(EnergyEntry.get())
 df['ST'] = float(TempoEntry.get())


 features2 = mapper.transform(df[['Lyrics_User', 'ST', 'SE', 'AP']])

这带来了另一个错误:

  

Tkinter回调Traceback中的异常(最近一次调用最后一次):
  文件   “ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ indexes \ base.py”,   get_loc中的第3063行       返回self._engine.get_loc(key)文件“ pandas_libs \ index.pyx”,第140行,在pandas._libs.index.IndexEngine.get_loc文件中   “ pandas_libs \ index.pyx”,第162行,在   pandas._libs.index.IndexEngine.get_loc文件   “ pandas_libs \ hashtable_class_helper.pxi”,第1492行,在   pandas._libs.hashtable.PyObjectHashTable.get_item文件   第1500行中的“ pandas_libs \ hashtable_class_helper.pxi”   pandas._libs.hashtable.PyObjectHashTable.get_item KeyError:'Lyrics'

     

在处理上述异常期间,发生了另一个异常:

     

回溯(最近通话最近):文件   “ C:\ Users \ moksh \ Anaconda3 \ lib \ tkinter__init __。py”,行1702,在   致电       在预测中返回self.func(* args)文件“”,第53行       features2 = mapper.transform(df [['Lyrics_User','ST','SE','AP']])文件   “ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ sklearn_pandas \ dataframe_mapper.py”,   转换中的第289行       Xt = self._get_col_subset(X,列,input_df)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ sklearn_pandas \ dataframe_mapper.py”,   _get_col_subset中的第182行       t = X [cols [0]]文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ frame.py”,   第2685行,在 getitem       返回self._getitem_column(key)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ frame.py”,   _getitem_column中的第2692行       返回self._get_item_cache(key)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ generic.py”,   第2486行,在_get_item_cache中       值= self._data.get(item)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ internals.py”,   4115行,进入       loc = self.items.get_loc(item)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ indexes \ base.py”,   第3065行,位于get_loc中       返回self._engine.get_loc(self._maybe_cast_indexer(key))文件“ pandas_libs \ index.pyx”,行140,在   pandas._libs.index.IndexEngine.get_loc文件   “ pandas_libs \ index.pyx”,第162行,在   pandas._libs.index.IndexEngine.get_loc文件   “ pandas_libs \ hashtable_class_helper.pxi”,第1492行,在   pandas._libs.hashtable.PyObjectHashTable.get_item文件   第1500行中的“ pandas_libs \ hashtable_class_helper.pxi”   pandas._libs.hashtable.PyObjectHashTable.get_item KeyError:'Lyrics'

1 个答案:

答案 0 :(得分:0)

您正在拟合两个不同的CountVectorizer对象(一个用于训练,另一个用于预测),该对象将学习两个不同的词汇集。

在训练过程中,由于数据量大且包含多个样本,因此产生的词汇量为1629个单词。但是在预测过程中,由于仅使用它来预测单个样本,因此词汇量为18。

这是错误的来源。

现在告诉我,为什么在预测时使用相同的model对象,而不是新对象?这是因为新的model不会学到任何东西。同样,CountVectorizer内部的原始mapper对象已经了解了一些有关数据的信息,这些信息需要在预测时使用。

您无需使用新的对象mapper2并调用fit_transform()(它将了解从头开始传递给它的数据),而是需要使用旧的mapper(已经安装),然后在其上调用transform()。

代替:

mapper2 = DataFrameMapper([
    ('Lyrics_User', CountVectorizer()),
    ('ST', None),
    ('SE', None),
    ('SL', None),
    ('SD', None),
    ('SS', None),
    ('SA', None),
    ('AP', None)
])
features2 = mapper2.fit_transform(df[['Lyrics_User', 'ST', 'SE', 'SL', 'SD', 'SS', 'SA', 'AP']])

执行以下操作:

features2 = mapper.transform(df[['Lyrics', 'ST', 'SE', 'SL', 'SD', 'SS', 'SA', 'AP']])