我正在尝试使用sklearn训练线性回归模型来预测给定推文的赞。我具有以下特征/属性。
['id', 'month', 'hour', 'text', 'hasMedia', 'hasHashtag', 'followers_count', 'retweet_count', 'favourite_count', 'sentiment', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust', ......keywords............]
我使用tfidfvectorizer提取关键字。问题在于,取决于训练数据的大小,关键字的数量不同,因此独立属性的数量也不同。因此,训练和测试数据之间的属性不匹配。我得到 ValueError:传递的值的形状为(1,1678),索引表示(1,1928)。
当我将相同的数据分成训练并进行测试并通过以下测试进行预测时,它可以正常工作。
培训和预测程序
def train_favourite_prediction(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# now you can save it to a file
joblib.dump(regressor, os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
return None
def predict_favourites(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
# and later you can load it
regressor = joblib.load(os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("the large training just finished")
return None
适合向量化的代码
看看Applying Tfidfvectorizer on list of pos tags gives ValueError了解我的“文本”列的格式。
def ready_for_training(dataset):
dataset = dataset.head(1000)
dataset['text'] = dataset.text.apply(lambda x: literal_eval(x))
dataset['text'] = dataset['text'].apply(
lambda row: [item for sublist in row for item in sublist])
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
keyword_response = tfidf.fit_transform(dataset['text'])
keyword_matrix = pd.DataFrame(keyword_response.todense(), columns=tfidf.get_feature_names())
keyword_matrix = keyword_matrix.loc[:, (keyword_matrix != 0).any(axis=0)]
dataset['sentiments'] = dataset['sentiments'].map(eval)
dataset = pd.concat([dataset.drop(['sentiments'], axis=1), dataset['sentiments'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['neg', 'neu','pos'], axis=1)
dataset['emotions'] = dataset['emotions'].map(eval)
dataset = pd.concat([dataset.drop(['emotions'], axis=1), dataset['emotions'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['id', 'month', 'text'], axis=1)
result = pd.concat([dataset, keyword_matrix], axis=1, sort=False)
return result
我需要的是在给出新的单个Tweet时预测'favourite_count'。当我获得此推文的关键字时,我只会得到一些。训练期间,我训练了1000多个关键字。我已将训练有素的知识存储在.pkl文件中。 我应该如何处理这种属性不匹配的问题?要像Keep same dummy variable in training and testing data中那样填充测试推文中缺少的列,我可能需要将训练集作为数据框。但是我将训练有素的知识存储为.pkl。并且将无法访问经过培训的知识中的列。