使用LineairSVM以整数/双精度值预测情绪得分

时间:2019-04-13 10:37:50

标签: python artificial-intelligence svm

我使用LineairSVM预测推文的情绪。 LSVM将推文分类为中性或肯定。我使用管道来(按顺序)清理,矢量化和分类推文。但是当预测情绪时,我只能得到0(否定)或4(否定)。我想获得小数位数介于-1和1之间的预测分数,以便更好地了解/了解推文的“积极”和“消极”程度:

代码:

#read in influential twitter users on stock market
twitter_users = pd.read_csv('core/infl_users.csv', encoding = "ISO-8859-1")
twitter_users.columns = ['users']

df = pd.DataFrame()
#MODEL TRAINING

#read trainingset for model : csv to dataframe
df = pd.read_csv("../trainingset.csv", encoding='latin-1')

#label trainingsset dataframe columns
frames = [df]
for colnames in frames:
    colnames.columns = ["target","id","data","query","user","text"]

#remove unnecessary columns
df = df.drop("id",1)
df = df.drop("data",1)
df = df.drop("query",1)
df = df.drop("user",1)


pat1 = r'@[A-Za-z0-9_]+'        # remove @ mentions fron tweets
pat2 = r'https?://[^ ]+'        # remove URL's from tweets
combined_pat = r'|'.join((pat1, pat2)) #addition of pat1 and pat2
www_pat = r'www.[^ ]+'         # remove URL's from tweets
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",   # converting words like isn't to is not
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def tweet_cleaner(text):  # define tweet_cleaner function to clean the tweets
    soup = BeautifulSoup(text, 'lxml')    # call beautiful object
    souped = soup.get_text()   # get only text from the tweets
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")    # remove utf-8-sig codeing
    except:
        bom_removed = souped
    stripped = re.sub(combined_pat, '', bom_removed) # calling combined_pat
    stripped = re.sub(www_pat, '', stripped) #remove URL's
    lower_case = stripped.lower()      # converting all into lower case
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case) # converting word's like isn't to is not
    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)       # will replace # by space
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1] # Word Punct Tokenize and only consider words whose length is greater than 1
    return (" ".join(words)).strip() # join the words


# Build a list of stopwords to use to filter
stopwords = list(STOP_WORDS)
# Use the punctuations of string module
punctuations = string.punctuation
# Creating a Spacy Parser
parser = English()

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    return text.strip().lower()
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens]

    #mytokens = [word.lemma_.lower().strip() for word in mytokens]
    mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]
    #mytokens = preprocess2(mytokens)
    return mytokens

# Vectorization
# Convert a collection of text documents to a matrix of token counts
# ngrams : extension of the unigram model by taking n words together
# big advantage: it preserves context. -> words that appear together in the text will also appear together in a n-gram
# n-grams can increase the accuracy in classifying pos & neg
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

# Linear Support Vector Classification.
# "Similar" to SVC with parameter kernel=’linear’
# more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
# LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:
classifier = LinearSVC(C=0.5)


# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)

#put tweet-text in X and target in ylabels to train model
X = df['text']
ylabels = df['target']

#T he next step is to load the data and split it into training and test datasets. In this example,
# we will use 80% of the dataset to train the model.This 80% is then splitted again in 80-20. 80% tot train the model, 20% to test results.
# the remaining 20% is kept to train the final model
X_tr, X_kast, y_tr, y_kast = train_test_split(X, ylabels, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_tr, y_tr, test_size=0.2, random_state=42)

# Create the  pipeline to clean, tokenize, vectorize, and classify
# Tying together different pieces of the ML process is known as a pipeline.
# Each stage of a pipeline is fed data processed from its preceding stage
# Pipelines only transform the observed data (X).
# Pipeline can be used to chain multiple estimators into one.
# The pipeline object is in the form of (key, value) pairs.
# Key is a string that has the name for a particular step
# value is the name of the function or actual method.

#Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

pipe_tfid = Pipeline([("cleaner", predictors()),
                 ('vectorizer', tfvectorizer),
                 ('classifier', classifier)])

# Fit our data, fit = training the model
pipe_tfid.fit(X_train,y_train)
# Predicting with a test dataset
#sample_prediction1 = pipe_tfid.predict(X_test)
accur = pipe_tfid.score(X_test,y_test)

当我预测情绪得分时

pipe_tfid.predict('textoftweet')

1 个答案:

答案 0 :(得分:0)

SVM在训练过程中计算权重w,以使各个类之间的间隔最大。然后使用函数进行预测(对于二进制分类器)

  

如果w ^ Tx + bias> 0,则选择 C1 ,否则选择 C2

SVM无法返回概率,因为它不是概率模型。有一些SVM的概率解释,例如this。 但是,如果您想了解预测的可信度,则最好使用一些标准的概率模型(例如NaiveBayes,LogisticRegression等)。