如何使用Apache Spark执行简单的网格搜索

时间:2017-07-24 00:02:41

标签: python apache-spark machine-learning scikit-learn grid-search

我尝试使用Scikit Learn的GridSearch类来调整逻辑回归算法的超参数。

然而,即使并行使用多个作业,GridSearch也需要几天的时间来处理,除非您只调整一个参数。我想过使用Apache Spark加速这个过程,但我有两个问题。

  • 为了使用Apache Spark,你真的需要多台机器来分配工作负载吗?例如,如果您只有1台笔记本电脑,那么使用Apache Spark是否毫无意义?

  • 是否有一种在Apache Spark中使用Scikit Learn的GridSearch的简单方法?

我已经阅读了文档,但它讨论了在整个机器学习管道上运行并行工作程序,但我只是希望它用于参数调整。

进口

import datetime
%matplotlib inline

import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab

import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB

import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()

from datetime import datetime as dt
import scipy
import itertools

ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')

pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")

new_style = {'grid': False}
plt.rc('axes', **new_style)

算法超参数调整

X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)

knn = KNeighborsClassifier()

parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'], 
              'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}


# ======== What I want to do in Apache Spark ========= #

%%time
parameters = {'n_neighbors': range(1, 100)}
clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)
best = clf1.best_estimator_

# ==================================================== #

1 个答案:

答案 0 :(得分:1)

您可以使用名为spark-sklearn的库来运行分布式参数扫描。您需要更正,因为您需要一台机器或一台多CPU机器来实现并行加速。

希望这有帮助,

Roope - Microsoft MMLSpark团队