使用PyMC3的分层贝叶斯线性回归非常慢

时间:2019-06-14 17:51:31

标签: bayesian python mcmc hierarchical-bayesian pymc

在使用adults dataset from the UCI repository进行逻辑回归的情况下,我正在尝试编写一些代码来实现HBM。

我已经编写了代码,但是采样速度非常慢,甚至对于64个尺寸或特征,采样速度约为每个样本107秒。我在做错什么吗?

我附上代码以供参考。我也对数据进行了重新缩放,这要归功于建议尝试加速数据,但无济于事。

感谢您的反馈。

该代码是herehere的混合代码。

#re loading the dataset this time without converting the country into one-hot vector rather for hierarchical modeling
adult_df = pd.read_csv('adult.data', header=None, sep=', ', )

adult_df.columns = ["Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"]


adult_df["Income"] = adult_df["Income"].map({ "<=50K": 0, ">50K": 1 })

adult_df.drop("CapitalGain", axis=1, inplace=True,)
adult_df.drop("CapitalLoss", axis=1, inplace=True,)

adult_df.Age = adult_df.Age.astype(float)
adult_df.fnlwgt = adult_df.fnlwgt.astype(float)
adult_df.EducationNum = adult_df.EducationNum.astype(float)
adult_df.HoursPerWeek = adult_df.HoursPerWeek.astype(float)


# dropping native country here!!
adult_df = pd.get_dummies(adult_df, columns=[
    "WorkClass", "Education", "MaritalStatus", "Occupation", "Relationship",
    "Race", "Gender",
])

standard_scaler_cols = ["Age", "fnlwgt", "EducationNum", "HoursPerWeek",]
other_cols = list(set(adult_df.columns) - set(standard_scaler_cols))
mapper = DataFrameMapper(
    [([col,], StandardScaler(),) for col in standard_scaler_cols] +
    [(col, None,) for col in other_cols]
)



le = preprocessing.LabelEncoder()
country_idx = le.fit_transform(adult_df['NativeCountry'])

pd.value_counts(pd.Series(y_all))
y_all = adult_df["Income"].values
adult_df.drop("Income", axis=1, inplace=True,)

adult_df.drop("NativeCountry", axis=1, inplace=True,)
n_countries = len(set(country_idx))
n_features = len(adult_df.columns) 

min_max_scaler = preprocessing.MinMaxScaler()

adult_df = min_max_scaler.fit_transform(adult_df)

X_train, X_test, y_train, y_test, country_idx_train, country_idx_test = train_test_split(adult_df, y_all, country_idx, train_size=0.1, test_size=0.25, stratify=y_all, random_state=rs)

with pm.Model() as multilevel_model:

    # Hyperiors for intercept      
    mu_theta = pm.MvNormal(name='mu_a', mu=np.zeros(n_features), cov=np.eye(n_features), shape=n_features)

   packed_L_theta = pm.LKJCholeskyCov('packed_L', n=n_features,
                                 eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
    L_theta = pm.expand_packed_triangular(n_features, packed_L_theta)
    theta = pm.MvNormal(mu=mu_theta, name='mu_theta', chol=L_theta, shape=[n_countries, n_features])


    # Hyperiors for intercept (Comment 1)
    mu_b = pm.StudentT('mu_b', nu=3, mu=0., sd=1.0)
    sigma_b = pm.HalfNormal('sigma_b', sd=1.0)

    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=[n_countries, 1])
    # Calculate predictions given values
    # for intercept and slope 
    yhat = pm.invlogit(b[country_idx_train] +  pm.math.dot(theta[country_idx_train], np.asarray(X_train).T))

    #Make predictions fit reality

    y = pm.Binomial('y', n=np.ones(y_train.shape[0]), p=yhat, observed=y_train)

1 个答案:

答案 0 :(得分:1)

关于pymc3问题,您可能会在我们的演讲中取得更大的成功:https://discourse.pymc.io/我邀请您将您的问题移到那里。

我要检查的第一件事是Theano是否针对MKL库进行编译,或者甚至是使用Python模式进行编译。如果您是通过conda安装的,那应该可以得到MKL,如果您使用的是pip,则可能会更困难。 http://deeplearning.net/software/theano/troubleshooting.html#test-blas