我在使用Logistic回归时遇到了问题。我目前正在使用Python进行推文主题分类。 到目前为止,我能够使用pandas从MySQL表中读取列车数据,使用NLTK清理列车推文并使用CountVectorizer创建特征向量。 这是下面的代码..
import pandas as pd
from sqlalchemy import *
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
#connect to database and get the training data
engine = create_engine('mysql+mysqlconnector://root:root@localhost:3306/machinelearning')
tweet = pd.read_sql_query('SELECT label, tweets FROM tweetstable', engine, index_col='label')
#TEXT PREPROCESSING (REMOVE HTML MARKUP, REMOVE PUNCTUATION, TOKENIZING, REMOVE STOP WORDS, STEMMING)
def preprocessing(pptweets):
pptweets = pptweets.lower()
urlrtweets = re.sub(r'https:.*$', ":", pptweets)
rpptweets = urlrtweets.replace("_", " ")
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(rpptweets)
filteredwords = [w for w in tokens if not w in stopwords.words('english')]
stemmer = SnowballStemmer("english")
stweets = [stemmer.stem(tokens) for tokens in filteredwords]
return " ".join(stweets)
#initialize an empty list to hold the clean reviews
cleantweets = []
#loop over each review, create an index i that goes from 0 to the length of tweets list
for i in range(0, len(tweet["tweets"])):
cleantweets.append(preprocessing(tweet["tweets"][i]))
#initialize the "CountVectorizer" object, which is scikit-learn's BoW tools
vectorizer = CountVectorizer(analyzer="word",
tokenizer=None,
preprocessor=None,
stop_words=None,
max_features=5000)
#fit_transform() does two functions: First, it fits the model
#and learns the vocabulary; second, it transforms our training data
#into feature vectors. the input to fit_transform should be a list of strings
traindatafeatures = vectorizer.fit_transform(cleantweets)
#Numpy arrays are easy to work with, so convert the result to an array
traindatafeatures = traindatafeatures.toarray()
我现在面临的问题是......我不知道如何使用Logistic回归来学习列车数据。这是我用来将列车数据拟合到Logistic回归分类器中的代码。
#train the model
logmodel = LogisticRegression()
logmodel.fit(traindatafeatures, tweet["label"])
#check trained model intercept
print(logmodel.intercept_)
#check trained model coefficients
print(logmodel.coef_)
我将traindatafeatures作为输入X和tweet [“label”]作为每条推文的标签/类Y传递给Logistic回归分类器,以便它可以从中学习但是当我运行完整代码时,我得到如下错误:
Traceback (most recent call last):
File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1945, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'label'
在处理上述异常期间,发生了另一个异常:
Traceback (most recent call last):
File "C:/Users/Indra/PycharmProjects/TextClassifier/textclassifier.py", line 52, in <module>
logmodel.fit(traindatafeatures, tweet["label"])
File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'label'
任何人都可以帮我解决这个问题吗? :(我一直在寻找教程,但到目前为止我还没有找到任何东西。