我建立了一个序数回归模型(第一次进行回归,要宽容),现在我需要对其进行评估。最好的方法是什么? (我使用mord API进行有序回归)



3)建立一个回归模型,以预测每个模型的评分   基于与一些非常常见的单词相对应的属性的产品   评论中使用的语言(选择剩下多少个单词   决定)。因此,对于每种产品,您将有一个长(ish)向量   属性取决于每个单词出现在评论中的次数   这个产品。您的目标变量是等级。您将被审判   在建立模型的过程中(正则化,子集   选择,验证集等),而不是准确性   结果。


4)根据问题3的向量,执行   降维(PCA或NMF)。你能总结一下   您可以保留许多组件?尝试使用此参数,然后   证明您的最终结论。


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import textblob
import nltk
from pandas import ExcelWriter
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from textblob import Word
from collections import Counter
import seaborn as sns
import mord as m
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

%matplotlib inline

df = # import dataframe from link

#Clean up Rating (whilst doing 'hand cleaning' I saw data outside of the [0,5] range; needs to be corrected; this could have been spotted by plotting the data on histogram but since I saw this while going throught the data I feel plotting it is an unnecessary step)
df.loc[df.Rating > 5, 'Rating'] = np.NaN
df.loc[df.Rating < 1, 'Rating'] = np.NaN

# Convert weights to same measure (pounds). Most of the weights I inspected seem wrong...

for i in range(0, df.weight.size-1):
    cell = df.weight[i]
    while (cell == 0 and i < df.weight.size-1):
        i += 1
        cell = df.weight[i]
    if not(isinstance(cell, float)) and  not(isinstance(cell, int)):
            number = ''.join([x for x in cell if (x.isdigit() or x=='.')])
            num = float(number)
            if bool(re.search('ounces', cell)):
                df.loc[i, 'weight'] = num * 0.0625    # Ounces to pounds conversion
                df.loc[i, 'weight'] = num            # Introduce only number (without measure type)

df.loc[:, "Review"] = df["Title"] + str(' - ') + df["Text"]
df.drop('Title', axis=1, inplace=True)
df.drop('Text', axis=1, inplace=True)
df.columns = ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']
df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')
df['Brand'] = df['Brand'].astype(str)
df['Review'] = df['Review'].astype(str)
df['Name'] = df['Name'].astype(str)

d = {'Brand':'first', 
df = df.groupby('Name').agg(d).reset_index()

df.Rating = df.Rating.round()
df.NumsHelpful = df.NumsHelpful.round()

df['Review2'] = df['Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))

df['Review2'] = df['Review2'].str.replace('[^\w\s]','')

stop = stopwords.words('english')
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

freq = pd.Series(' '.join(df['Review2']).split()).value_counts()[:20]

common = ['wine', 'mix', 'taste', 'drink', 'one', 'price', 'product', 'flavour', 'would', 'bitters', 'bottle', 'buy','really', 'make']
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in common))

freq = pd.Series(' '.join(df['Review2']).split()).value_counts()[-10:]

freq = list(freq.index)
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))

df['words'] = df.Review2.str.strip().str.split('[\W_]+')

df['Review2'] = df['words'].apply(lambda x: " ".join([Word(word).lemmatize('v') for word in x]))

# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)

# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]

# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)

# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)

# Do PCA (dimensionality reduction)
scaler = StandardScaler()
# Fit on training set only.
# Apply transform to both the training set and the test set.
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
# Make an instance of the Model
pca = PCA(.95)
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=10, scoring='accuracy')






