我试图通过仅考虑3个变量来预测一条推文是否具有病毒性(为简单起见):
这是我的代码:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
all_tweets = pd.read_json("random_tweets.json", lines=True)
# The cutoff value for a tweet to be considered as viral is 1000 (subjective).
all_tweets['is_viral'] = np.where(all_tweets['retweet_count'] >= 1000, 1, 0)
print(all_tweets['is_viral'].value_counts())
all_tweets['tweet_length'] = all_tweets.apply(lambda tweet: len(tweet['text']), axis=1)
all_tweets['followers_count'] = all_tweets.apply(lambda tweet: tweet['user']['followers_count'], axis=1)
all_tweets['friends_count'] = all_tweets.apply(lambda tweet: tweet['user']['friends_count'], axis=1)
labels = all_tweets['is_viral']
data = all_tweets[['tweet_length','followers_count','friends_count']]
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
train_data, test_data, train_labels, test_labels = train_test_split(scaled_data, labels, test_size = 0.2, random_state = 1)
classifier = KNeighborsClassifier(n_neighbors = 25)
classifier.fit(train_data, train_labels)
print(classifier.score(test_data, test_labels))
prediction = np.array([[140, 3313, 2272]])
scaled_pred = scaler.transform(prediction)
print(scaled_pred)
print(classifier.predict(scaled_pred))
此预测变量的各个组成部分(推特长度,关注者数量,朋友数量)与数据集中被视为病毒性(= [1])的真实数据点之一完全匹配。
但是,尽管它的准确率大约为80%,但我得到的预测是[0](不是病毒)。
有人知道为什么算法无法正确分类吗?
P.s。通过反复试验选择K(= 25),从而获得最高的准确度得分