使用Logistic回归进行推文主题分类

时间:2016-05-28 16:37:18

标签: python mysql pandas logistic-regression text-classification

我在使用Logistic回归时遇到了问题。我目前正在使用Python进行推文主题分类。 到目前为止,我能够使用pandas从MySQL表中读取列车数据,使用NLTK清理列车推文并使用CountVectorizer创建特征向量。 这是下面的代码..

import pandas as pd
from sqlalchemy import *
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

#connect to database and get the training data
engine = create_engine('mysql+mysqlconnector://root:root@localhost:3306/machinelearning')
tweet = pd.read_sql_query('SELECT label, tweets FROM tweetstable', engine, index_col='label')

#TEXT PREPROCESSING (REMOVE HTML MARKUP, REMOVE PUNCTUATION, TOKENIZING, REMOVE STOP WORDS, STEMMING)

def preprocessing(pptweets):
    pptweets = pptweets.lower()
    urlrtweets = re.sub(r'https:.*$', ":", pptweets)
    rpptweets = urlrtweets.replace("_", " ")
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rpptweets)
    filteredwords = [w for w in tokens if not w in stopwords.words('english')]
    stemmer = SnowballStemmer("english")
    stweets = [stemmer.stem(tokens) for tokens in filteredwords]
    return " ".join(stweets)

#initialize an empty list to hold the clean reviews
cleantweets = []

#loop over each review, create an index i that goes from 0 to the length of tweets list
for i in range(0, len(tweet["tweets"])):
    cleantweets.append(preprocessing(tweet["tweets"][i]))

#initialize the "CountVectorizer" object, which is scikit-learn's BoW tools
vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000)

#fit_transform() does two functions: First, it fits the model
#and learns the vocabulary; second, it transforms our training data
#into feature vectors. the input to fit_transform should be a list of strings
traindatafeatures = vectorizer.fit_transform(cleantweets)

#Numpy arrays are easy to work with, so convert the result to an array
traindatafeatures = traindatafeatures.toarray()

我现在面临的问题是......我不知道如何使用Logistic回归来学习列车数据。这是我用来将列车数据拟合到Logistic回归分类器中的代码。

#train the model
logmodel = LogisticRegression()
logmodel.fit(traindatafeatures, tweet["label"])

#check trained model intercept

print(logmodel.intercept_)
#check trained model coefficients
print(logmodel.coef_)

我将traindatafeatures作为输入X和tweet [“label”]作为每条推文的标签/类Y传递给Logistic回归分类器,以便它可以从中学习但是当我运行完整代码时,我得到如下错误:

Traceback (most recent call last):
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1945, in get_loc
    return self._engine.get_loc(key)
  File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
  File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
  File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
  File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'label'

在处理上述异常期间,发生了另一个异常:

Traceback (most recent call last):
  File "C:/Users/Indra/PycharmProjects/TextClassifier/textclassifier.py", line 52, in <module>
    logmodel.fit(traindatafeatures, tweet["label"])
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1997, in __getitem__
    return self._getitem_column(key)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2004, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1350, in _get_item_cache
    values = self._data.get(item)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3290, in get
    loc = self.items.get_loc(item)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1947, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
  File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
  File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
  File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'label'

任何人都可以帮我解决这个问题吗? :(我一直在寻找教程,但到目前为止我还没有找到任何东西。

0 个答案:

没有答案