PyMC3:后级橄榄球模型?

时间:2017-01-17 22:12:48

标签: python bayesian pymc mcmc pymc3

我刚开始阅读PyMC3 documentation(我对sklearn感到更舒服)并遇到了Rugby hierarchical model example

# Imports and Rugby data setup -- model in next section

import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
import matplotlib.pyplot as plt
import seaborn as sns

games = [
    ['Wales', 'Italy', 23, 15],
    ['France', 'England', 26, 24],
    ['Ireland', 'Scotland', 28, 6],
    ['Ireland', 'Wales', 26, 3],
    ['Scotland', 'England', 0, 20],
    ['France', 'Italy', 30, 10],
    ['Wales', 'France', 27, 6],
    ['Italy', 'Scotland', 20, 21],
    ['England', 'Ireland', 13, 10],
    ['Ireland', 'Italy', 46, 7],
    ['Scotland', 'France', 17, 19],
    ['England', 'Wales', 29, 18],
    ['Italy', 'England', 11, 52],
    ['Wales', 'Scotland', 51, 3],
    ['France', 'Ireland', 20, 22],
]
columns = ['home_team', 'away_team', 'home_score', 'away_score']
df = pd.DataFrame(games, columns=columns)

teams = df.home_team.unique()
teams = pd.DataFrame(teams, columns=['team'])
teams['i'] = teams.index

df = pd.merge(df, teams, left_on='home_team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_home'}).drop('team', 1)
df = pd.merge(df, teams, left_on='away_team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_away'}).drop('team', 1)

observed_home_goals = df.home_score.values
observed_away_goals = df.away_score.values

home_team = df.i_home.values
away_team = df.i_away.values

num_teams = len(df.i_home.drop_duplicates())
num_games = len(home_team)

g = df.groupby('i_away')
att_starting_points = np.log(g.away_score.mean())
g = df.groupby('i_home')
def_starting_points = -np.log(g.away_score.mean())

以下是主要PyMC3模型设置:

with pm.Model() as model:
    # Global model parameters
    home = pm.Normal('home', 0, tau=.0001)
    tau_att = pm.Gamma('tau_att', .1, .1)
    tau_def = pm.Gamma('tau_def', .1, .1)
    intercept = pm.Normal('intercept', 0, tau=.0001)

    # Team-specific model parameters
    atts_star = pm.Normal('atts_star', mu=0, tau=tau_att, shape=num_teams)
    defs_star = pm.Normal('defs_star', mu=0, tau=tau_def, shape=num_teams)

    atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star))
    defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))
    home_theta = tt.exp(intercept + home + atts[home_team] + defs[away_team])
    away_theta = tt.exp(intercept + atts[away_team] + defs[home_team])

    # Likelihood of observed data
    home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_goals)
    away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_goals)

    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(20000, step, init=start) 

我知道如何绘制trace

pm.traceplot(trace[5000:])

并生成posterior predictive samples

ppc = pm.sample_ppc(trace[5000:], samples=500, model=model)

我不确定:如何提出模型/后验问题?

例如,我假设Wales vs Italy对决的分数分布为:

# Wales vs Italy is the first matchup in our dataset
home_wales = ppc['home_points'][:, 0]
away_italy = ppc['away_points'][:, 0]

但是原始数据中没有记录的比赛呢?

  • 如果意大利队主场迎战法国,那么他们的分数分布是什么样的?
  • 如果意大利队主场迎战法国队,那么两支队伍的得分频率是多少?

感谢您提供任何帮助/见解。

1 个答案:

答案 0 :(得分:2)

我很确定在阅读PyMC3 Hierarchical Partial Pooling example之后我能够解决这个问题。按顺序回答问题:

  1. 是的,这就是Italy vs France对决的分布(因为它是观察到的数据中的第一个游戏)。

  2. 为了预测thetas(因为两支球队在原始数据集中没有相互比较),我们需要预测# Setup the model similarly to the previous one... with pm.Model() as model: # Global model parameters home = pm.Normal('home', 0, tau=.0001) tau_att = pm.Gamma('tau_att', .1, .1) tau_def = pm.Gamma('tau_def', .1, .1) intercept = pm.Normal('intercept', 0, tau=.0001) # Team-specific model parameters atts_star = pm.Normal('atts_star', mu=0, tau=tau_att, shape=num_teams) defs_star = pm.Normal('defs_star', mu=0, tau=tau_def, shape=num_teams) atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star)) defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star)) home_theta = tt.exp(intercept + home + atts[home_team] + defs[away_team]) away_theta = tt.exp(intercept + atts[away_team] + defs[home_team]) # Likelihood of observed data home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_goals) away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_goals) # Now for predictions with no games played... with model: # IDs from `teams` DataFrame italy, france = 4, 1 # New `thetas` for Italy vs France predictions pred_home_theta = tt.exp(intercept + home + atts[italy] + defs[france]) pred_away_theta = tt.exp(intercept + atts[france] + defs[italy]) pred_home_points = pm.Poisson('pred_home_points', mu=pred_home_theta) pred_away_points = pm.Poisson('pred_away_points', mu=pred_away_theta) # Sample the final model with model: start = pm.find_MAP() step = pm.NUTS(state=start) trace = pm.sample(20000, step, init=start)

  3. 以下是更新模型的代码:

    trace

    # Use 5,000 as MCMC burn in pred = pd.DataFrame({ "italy": trace["pred_home_points"][5000:], "france": trace["pred_away_points"][5000:], }) # Plot the distributions sns.kdeplot(pred.italy, shade=True, label="Italy") sns.kdeplot(pred.france, shade=True, label="France") plt.show() 完成后,我们可以绘制预测图:

    # 19% of the time
    (pred.italy > pred.france).mean()
    

    Italy vs France Rugby distributions

    意大利多久在家中获胜?

    # 0.7% of the time
    1.0 * len(pred[(pred.italy <= 15) & (pred.france <= 15)]) / len(pred)
    

    这两支队伍的得分频率是否低于15?

    BT0-3
    BT0-4
    BT0-5