NBA射击数据的逻辑回归

时间:2017-04-26 20:09:14

标签: r logistic-regression prediction

我正在使用NBA拍摄数据,并尝试使用不同的回归技术创建拍摄预测模型。但是,在尝试使用逻辑回归模型时,我遇到以下警告消息:警告消息: glm.fit:算法没有收敛。此外,似乎预测根本不起作用(未从原始Y变量(制造或未命中)改变)。我将在下面提供我的代码。我从这里得到了数据:Shot Data.

nba_shots <- read.csv("shot_logs.csv")
library(dplyr)
library(ggplot2)
library(data.table)
library("caTools")
library(glmnet)
library(caret)

nba_shots_clean <- data.frame("game_id" = nba_shots$GAME_ID, "location" = 
nba_shots$LOCATION, "shot_number" = nba_shots$SHOT_NUMBER, 
                    "closest_defender" = nba_shots$CLOSEST_DEFENDER,
                    "defender_distance" = nba_shots$CLOSE_DEF_DIST, "points" = nba_shots$PTS, 
                    "player_name" = nba_shots$player_name, "dribbles" = nba_shots$DRIBBLES,
                    "shot_clock" = nba_shots$SHOT_CLOCK, "quarter" = nba_shots$PERIOD,
                    "touch_time" = nba_shots$TOUCH_TIME, "game_result" = nba_shots$W
                    , "FGM" = nba_shots$FGM)

mean(nba_shots_clean$shot_clock) # NA
# this gave NA return which means that there are NAs in this column that we 
# need to clean up
# if the shot clock was NA I assume that this means it was the end of a 
# quarter and the shot clock was off.
# For now I'm going to just set all of these NAs equal to zero, so all zeros 
# mean it is the end of a quarter
# checking the amount of NAs
last_shots <- nba_shots_clean[is.na(nba_shots_clean$shot_clock),]
nrow(last_shots) # this tells me there is 5567 shots taken when the shot 
# clock was turned off at the end of a quarter
# setting these NAs equal to zero
nba_shots_clean[is.na(nba_shots_clean)] <- 0
# checking to see if it worked
nrow(nba_shots_clean[is.na(nba_shots_clean$shot_clock),]) # it worked 

# create a test and train set
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance + 
points + dribbles + shot_clock + quarter + touch_time, data=nbaTrain, 
family="binomial", na.action = na.omit)

nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)

这给了我以下的输出,它告诉我预测没有做任何事情,因为它与以前一样。

   FALSE  TRUE
0 21428     0
1   0    17977

我真的很感激任何指导。

1 个答案:

答案 0 :(得分:2)

模型的混淆矩阵(模型预测与nbaTest$FGM)告诉您模型的准确度为100%! 这是由于数据集中的points变量与因变量完美关联:

table(nba_shots_clean$points, nba_shots_clean$FGM)
        0     1
  0 87278     0
  2     0 58692
  3     0 15133

尝试从模型中删除points

# create a test and train set
set.seed(1234)
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)

# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance + 
dribbles + shot_clock + quarter + touch_time, data=nbaTrain, 
family="binomial", na.action = na.omit)
summary(nbaLogitModel)

现在没有警告消息,估计的模型是:

Call:
glm(formula = FGM ~ location + shot_number + defender_distance + 
    dribbles + shot_clock + quarter + touch_time, family = "binomial", 
    data = nbaTrain, na.action = na.omit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.8995  -1.1072  -0.9743   1.2284   1.6799  

Coefficients:
                   Estimate Std. Error z value       Pr(>|z|)    
(Intercept)       -0.427688   0.025446 -16.808        < 2e-16 ***
locationH          0.037920   0.012091   3.136        0.00171 ** 
shot_number        0.007972   0.001722   4.630 0.000003656291 ***
defender_distance -0.006990   0.002242  -3.117        0.00182 ** 
dribbles           0.010582   0.004859   2.178        0.02941 *  
shot_clock         0.032759   0.001083  30.244        < 2e-16 ***
quarter           -0.043100   0.007045  -6.118 0.000000000946 ***
touch_time        -0.038006   0.005700  -6.668 0.000000000026 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 153850  on 111532  degrees of freedom
Residual deviance: 152529  on 111525  degrees of freedom
AIC: 152545

Number of Fisher Scoring iterations: 4

混淆矩阵是:

nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)

  FALSE  TRUE
0 21554  5335
1 16726  5955