Python Sklearn在一个看不见的数据集上预测值

时间:2018-02-19 14:06:24

标签: python sklearn-pandas adaboost

我在数据库中有一组足球数据,我试图预测它的值。

library(dplyr)
library(tidyr)
df1 %>%
   group_by(village) %>% 
   separate_rows(String, sep=",\\s*") %>%
   filter(nzchar(String)) %>% 
   count(village, String) %>% 
   spread(String, n, fill = 0)
# A tibble: 3 x 5
# Groups: village [3]
#  village fd_sec ht_rm  `NA`   san
#* <chr>    <dbl> <dbl> <dbl> <dbl>
#1 A         1.00  2.00  1.00  0   
#2 B         1.00  0     0     0   
#3 C         0     1.00  0     1.00

将数据集加载到数据框并运行test_train_split并填入后,如何预测未见数据集的值并返回game_id&和预测值(FTR)?

正如您在代码中看到的,我有一个表(tmp_all_output_id),我在其中选择已知的结果值到游戏&#39;并选择未知(或未播放)的结果为&#39; predict_games&#39;。我还为&#39; predict_games&#39;设置了FTR(全职结果)。 = -10,此时这些游戏的结果尚不清楚。

但是,我如何使用我所做的培训来预测数据框的FTR&#39; predict_games&#39;?

我尝试使用这段代码进行预测,但是对于FTR来说它总是带回0(绘图),这肯定是不正确的。

import MySQLdb
import pandas as pd
from sklearn.feature_selection import RFE
from sqlalchemy import create_engine
import mysql.connector
from matplotlib import pyplot

mysql_cn= MySQLdb.connect(host='database.rds.amazonaws.com',port=3306,user='username', passwd='password', db='dev')
games = pd.read_sql('SELECT game_id, game_date_id, home_team_id, away_team_id, referee_id, FTR, away_team_travel FROM   
dev.tmp_all_output_id  WHERE game_id < 6700;', con=mysql_cn)    

predict_games = pd.read_sql('SELECT game_id, game_date_id,      
home_team_id, away_team_id, referee_id, -10 AS FTR, away_team_travel FROM dev.tmp_all_output_id  WHERE game_id > 6700;', con=mysql_cn)

feature_names = ['game_id', 'game_date_id', 'home_team_id', 'away_team_id', 'referee_id', 'away_team_travel']
X = games[feature_names]
y = games['FTR']

# #Create Training and Test Sets and Apply Scaling
from sklearn.model_selection import train_test_split
validation_size = 0.20
seed = 7
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_size, random_state=0)

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)

predictions = ada.predict(X_test)

print('Accuracy of AdaBoostClassifier on training set: {:.2f}'.format(ada.score(X_train, y_train)))
print('Accuracy of AdaBoostClassifier on test set: {:.2f}'.format(ada.score(X_test, y_test)))

#cnx = create_engine('mysql+mysqlconnector://username:password@database.rds.amazonaws.com:3306/dev', echo=False)
#testResults.to_sql(name='tmp_all_output_prediction', con=cnx, if_exists = 'replace', index=False)

mysql_cn.close()

我添加了以下代码:

testResults = predict_games[['game_id']] testResults.is_copy = None testResults['FTR'] = raw_prediction

然而,每个预测值都返回为:-1(离开胜利),这是不正确的

1 个答案:

答案 0 :(得分:0)

您的ada变量现在是经过训练的分类器实例。为了使用它来对新数据进行分类,您需要使用与X对应的格式构建'game_id', 'game_date_id', 'home_team_id', 'away_team_id', 'referee_id', 'away_team_travel'数据。

然后你运行ada.predict(X),你就完成了!

问题是你目前只传递了game_id。