我正在使用NBA拍摄数据,并尝试使用不同的回归技术创建拍摄预测模型。但是,在尝试使用逻辑回归模型时,我遇到以下警告消息:警告消息: glm.fit:算法没有收敛。此外,似乎预测根本不起作用(未从原始Y变量(制造或未命中)改变)。我将在下面提供我的代码。我从这里得到了数据:Shot Data.
nba_shots <- read.csv("shot_logs.csv")
library(dplyr)
library(ggplot2)
library(data.table)
library("caTools")
library(glmnet)
library(caret)
nba_shots_clean <- data.frame("game_id" = nba_shots$GAME_ID, "location" =
nba_shots$LOCATION, "shot_number" = nba_shots$SHOT_NUMBER,
"closest_defender" = nba_shots$CLOSEST_DEFENDER,
"defender_distance" = nba_shots$CLOSE_DEF_DIST, "points" = nba_shots$PTS,
"player_name" = nba_shots$player_name, "dribbles" = nba_shots$DRIBBLES,
"shot_clock" = nba_shots$SHOT_CLOCK, "quarter" = nba_shots$PERIOD,
"touch_time" = nba_shots$TOUCH_TIME, "game_result" = nba_shots$W
, "FGM" = nba_shots$FGM)
mean(nba_shots_clean$shot_clock) # NA
# this gave NA return which means that there are NAs in this column that we
# need to clean up
# if the shot clock was NA I assume that this means it was the end of a
# quarter and the shot clock was off.
# For now I'm going to just set all of these NAs equal to zero, so all zeros
# mean it is the end of a quarter
# checking the amount of NAs
last_shots <- nba_shots_clean[is.na(nba_shots_clean$shot_clock),]
nrow(last_shots) # this tells me there is 5567 shots taken when the shot
# clock was turned off at the end of a quarter
# setting these NAs equal to zero
nba_shots_clean[is.na(nba_shots_clean)] <- 0
# checking to see if it worked
nrow(nba_shots_clean[is.na(nba_shots_clean$shot_clock),]) # it worked
# create a test and train set
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance +
points + dribbles + shot_clock + quarter + touch_time, data=nbaTrain,
family="binomial", na.action = na.omit)
nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)
这给了我以下的输出,它告诉我预测没有做任何事情,因为它与以前一样。
FALSE TRUE
0 21428 0
1 0 17977
我真的很感激任何指导。
答案 0 :(得分:2)
模型的混淆矩阵(模型预测与nbaTest$FGM
)告诉您模型的准确度为100%!
这是由于数据集中的points
变量与因变量完美关联:
table(nba_shots_clean$points, nba_shots_clean$FGM)
0 1
0 87278 0
2 0 58692
3 0 15133
尝试从模型中删除points
:
# create a test and train set
set.seed(1234)
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance +
dribbles + shot_clock + quarter + touch_time, data=nbaTrain,
family="binomial", na.action = na.omit)
summary(nbaLogitModel)
现在没有警告消息,估计的模型是:
Call:
glm(formula = FGM ~ location + shot_number + defender_distance +
dribbles + shot_clock + quarter + touch_time, family = "binomial",
data = nbaTrain, na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8995 -1.1072 -0.9743 1.2284 1.6799
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.427688 0.025446 -16.808 < 2e-16 ***
locationH 0.037920 0.012091 3.136 0.00171 **
shot_number 0.007972 0.001722 4.630 0.000003656291 ***
defender_distance -0.006990 0.002242 -3.117 0.00182 **
dribbles 0.010582 0.004859 2.178 0.02941 *
shot_clock 0.032759 0.001083 30.244 < 2e-16 ***
quarter -0.043100 0.007045 -6.118 0.000000000946 ***
touch_time -0.038006 0.005700 -6.668 0.000000000026 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 153850 on 111532 degrees of freedom
Residual deviance: 152529 on 111525 degrees of freedom
AIC: 152545
Number of Fisher Scoring iterations: 4
混淆矩阵是:
nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)
FALSE TRUE
0 21554 5335
1 16726 5955