因此,我一直在尝试根据歌曲的歌词和其他参数(例如拍子等)对歌曲的流行程度进行分类。现在,这是我尝试通过tkinter运行的代码段。
import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer,CountVectorizer
df = pd.read_csv(r'Dataset(Advanced)(processed lyrics).csv')
df['Lyrics'] = df['Lyrics'].astype(str)
mapper = DataFrameMapper([('Lyrics', CountVectorizer()),
('Tempo', None),
('Energy', None),
('Loudness', None),
('Danceability', None),
('Speechiness', None),
('Acousticness', None),
('Artist Hit', None)
])
features = mapper.fit_transform(df[['Lyrics', 'Tempo', 'Energy', 'Loudness', 'Danceability', 'Speechiness'
, 'Acousticness', 'Artist Hit']])
y = df['Hit']
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(features, y)
现在,这是我单击按钮时调用的函数。在这里,我将一首歌的所有值(如歌词,节奏等)都转换为dataframe属性,以使其适合DataFrameMapper。尽管所有这些看起来都不错,
def predict():
user_Lyrics = lyricsTextBox2.get(1.0, "end-1c")
user_Lyrics = user_Lyrics.values.astype(str)
print(user_Lyrics.head())
print(type(user_Lyrics))
# Everything in lowercase
user_Lyrics = user_Lyrics.apply(lambda x: " ".join(x.lower() for x in str(x).split()))
# Removing punctuation that does not add meaning to the song
user_Lyrics = user_Lyrics.str.replace('[^\w\s]', '')
# Removing of stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
user_Lyrics = user_Lyrics.apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))
# Correction of Spelling mistakes
from textblob import TextBlob
user_Lyrics = user_Lyrics.apply(lambda x: str(TextBlob(x).correct()))
# Lemmatization is basically converting a word into its root word. It is preferred over Stemming.
from textblob import Word
user_Lyrics = user_Lyrics.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['AP'] = float(ArtistPopularityEntry.get())
df['SE'] = float(EnergyEntry.get())
df['SL'] = float(LoudnessEntry.get())
df['SA'] = float(AcousticnessEntry.get())
df['ST'] = float(TempoEntry.get())
df['SD'] = float(DanceabilityEntry.get())
df['SS'] = float(SpeechinessEntry.get())
mapper2 = DataFrameMapper([
('Lyrics_User', CountVectorizer()),
('ST', None),
('SE', None),
('SL', None),
('SD', None),
('SS', None),
('SA', None),
('AP', None)
])
features2 = mapper2.fit_transform(df[['Lyrics_User', 'ST', 'SE', 'SL', 'SD', 'SS', 'SA', 'AP']])
print(type(features2))
print(len(features2))
print(features2.shape)
print(type(features))
print(len(features))
print(features.shape)
user_prediction = model.predict(features2)
print(user_prediction)
if (user_prediction[0] == 1):
resultLabel2.config(text='Song is Hit')
else:
resultLabel2.config(text='Song is not Hit')
输出:
<class 'numpy.ndarray'>
831
(831, 18)
<class 'numpy.ndarray'>
831
(831, 1629)
Error:
Exception in Tkinter callback Traceback (most recent call last): File "C:\Users\moksh\Anaconda3\lib\tkinter\__init__.py", line 1702, in
__call__
return self.func(*args) File "<ipython-input-4-f6ddab248363>",
line 69, in predict
user_prediction = model.predict(features2) File
"C:\Users\moksh\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line
66, in predict
jll = self._joint_log_likelihood(X) File
"C:\Users\moksh\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line
725, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T) + File
"C:\Users\moksh\Anaconda3\lib\site-packages\sklearn\utils\extmath.py",
line 140, in safe_sparse_dot
return np.dot(a, b) ValueError: shapes (831,18) and (1629,2) not
aligned: 18 (dim 1) != 1629 (dim 0)
编辑
df['AP'] = float(ArtistPopularityEntry.get())
df['SE'] = float(EnergyEntry.get())
df['ST'] = float(TempoEntry.get())
features2 = mapper.transform(df[['Lyrics_User', 'ST', 'SE', 'AP']])
这带来了另一个错误:
Tkinter回调Traceback中的异常(最近一次调用最后一次):
文件 “ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ indexes \ base.py”, get_loc中的第3063行 返回self._engine.get_loc(key)文件“ pandas_libs \ index.pyx”,第140行,在pandas._libs.index.IndexEngine.get_loc文件中 “ pandas_libs \ index.pyx”,第162行,在 pandas._libs.index.IndexEngine.get_loc文件 “ pandas_libs \ hashtable_class_helper.pxi”,第1492行,在 pandas._libs.hashtable.PyObjectHashTable.get_item文件 第1500行中的“ pandas_libs \ hashtable_class_helper.pxi” pandas._libs.hashtable.PyObjectHashTable.get_item KeyError:'Lyrics'在处理上述异常期间,发生了另一个异常:
回溯(最近通话最近):文件 “ C:\ Users \ moksh \ Anaconda3 \ lib \ tkinter__init __。py”,行1702,在 致电 在预测中返回self.func(* args)文件“”,第53行 features2 = mapper.transform(df [['Lyrics_User','ST','SE','AP']])文件 “ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ sklearn_pandas \ dataframe_mapper.py”, 转换中的第289行 Xt = self._get_col_subset(X,列,input_df)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ sklearn_pandas \ dataframe_mapper.py”, _get_col_subset中的第182行 t = X [cols [0]]文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ frame.py”, 第2685行,在 getitem 返回self._getitem_column(key)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ frame.py”, _getitem_column中的第2692行 返回self._get_item_cache(key)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ generic.py”, 第2486行,在_get_item_cache中 值= self._data.get(item)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ internals.py”, 4115行,进入 loc = self.items.get_loc(item)文件“ C:\ Users \ moksh \ Anaconda3 \ lib \ site-packages \ pandas \ core \ indexes \ base.py”, 第3065行,位于get_loc中 返回self._engine.get_loc(self._maybe_cast_indexer(key))文件“ pandas_libs \ index.pyx”,行140,在 pandas._libs.index.IndexEngine.get_loc文件 “ pandas_libs \ index.pyx”,第162行,在 pandas._libs.index.IndexEngine.get_loc文件 “ pandas_libs \ hashtable_class_helper.pxi”,第1492行,在 pandas._libs.hashtable.PyObjectHashTable.get_item文件 第1500行中的“ pandas_libs \ hashtable_class_helper.pxi” pandas._libs.hashtable.PyObjectHashTable.get_item KeyError:'Lyrics'
答案 0 :(得分:0)
您正在拟合两个不同的CountVectorizer
对象(一个用于训练,另一个用于预测),该对象将学习两个不同的词汇集。
在训练过程中,由于数据量大且包含多个样本,因此产生的词汇量为1629个单词。但是在预测过程中,由于仅使用它来预测单个样本,因此词汇量为18。
这是错误的来源。
现在告诉我,为什么在预测时使用相同的model
对象,而不是新对象?这是因为新的model
不会学到任何东西。同样,CountVectorizer
内部的原始mapper
对象已经了解了一些有关数据的信息,这些信息需要在预测时使用。
您无需使用新的对象mapper2
并调用fit_transform()
(它将了解从头开始传递给它的数据),而是需要使用旧的mapper
(已经安装),然后在其上调用transform()。
代替:
mapper2 = DataFrameMapper([
('Lyrics_User', CountVectorizer()),
('ST', None),
('SE', None),
('SL', None),
('SD', None),
('SS', None),
('SA', None),
('AP', None)
])
features2 = mapper2.fit_transform(df[['Lyrics_User', 'ST', 'SE', 'SL', 'SD', 'SS', 'SA', 'AP']])
执行以下操作:
features2 = mapper.transform(df[['Lyrics', 'ST', 'SE', 'SL', 'SD', 'SS', 'SA', 'AP']])