KNN算法的分类预测不准确

时间:2020-07-15 21:51:03

标签: python machine-learning scikit-learn classification knn

我试图通过仅考虑3个变量来预测一条推文是否具有病毒性(为简单起见):

  • 鸣叫长度
  • 特定Twitter帐户的关注者数量
  • 特定Twitter帐户的朋友数量

这是我的代码:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

all_tweets = pd.read_json("random_tweets.json", lines=True)

# The cutoff value for a tweet to be considered as viral is 1000 (subjective).
all_tweets['is_viral'] = np.where(all_tweets['retweet_count'] >= 1000, 1, 0)
print(all_tweets['is_viral'].value_counts())

all_tweets['tweet_length'] = all_tweets.apply(lambda tweet: len(tweet['text']), axis=1)
all_tweets['followers_count'] = all_tweets.apply(lambda tweet: tweet['user']['followers_count'], axis=1)
all_tweets['friends_count'] = all_tweets.apply(lambda tweet: tweet['user']['friends_count'], axis=1)

labels = all_tweets['is_viral']
data = all_tweets[['tweet_length','followers_count','friends_count']]

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

train_data, test_data, train_labels, test_labels = train_test_split(scaled_data, labels, test_size = 0.2, random_state = 1)

classifier = KNeighborsClassifier(n_neighbors = 25)
classifier.fit(train_data, train_labels)
print(classifier.score(test_data, test_labels))

prediction = np.array([[140, 3313, 2272]])
scaled_pred = scaler.transform(prediction)
print(scaled_pred)
print(classifier.predict(scaled_pred))

此预测变量的各个组成部分(推特长度,关注者数量,朋友数量)与数据集中被视为病毒性(= [1])的真实数据点之一完全匹配。

但是,尽管它的准确率大约为80%,但我得到的预测是[0](不是病毒)。

有人知道为什么算法无法正确分类吗?

P.s。通过反复试验选择K(= 25),从而获得最高的准确度得分

0 个答案:

没有答案