我想构建PySpark管道用于超参数调整和模型选择。下面是一些网页示例/教程之后提供的代码。由于我是PySpark(和Spark)的新手,所以我希望也许有人可以帮助我查看代码,并在可能的情况下为我提供一些有关优化方面的提示。
进口
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import StringIndexer, StandardScaler, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from spark_stratifier import StratifiedCrossValidator
读取文件(只需用'_替换'。',因为如果列名包含'。',PySpark会感到困惑):
credit = spark.read.options(header = True, sep=';',
inferSchema = True
).csv('..//bank-additional/'+\
'bank-additional-full.csv')
colnames = list(map(lambda x: x.replace(".", "_"), credit.columns))
credit = credit.toDF(*colnames)
标识数字列和分类列
numerics = [x[0] for x in credit.dtypes if not x[1].startswith('string')]
categoricals = [x[0] for x in credit.dtypes if x[1].startswith('string')]
categoricals.remove('y')
定义数据转换和分类器,可以稍后将其插入管道中。 MLlib要求将输入要素形成为矢量,这就是为什么使用VectorAssembler来为分类器形成数据帧的原因:
y_indexer = StringIndexer(inputCol='y', outputCol='label')
x_indexers = [StringIndexer(inputCol=column,
outputCol=column+"_indx")
for column in list(set(categoricals)) ]
assembler = VectorAssembler(inputCols= numerics, outputCol="raw_features")
standardizer = StandardScaler(withMean=True, withStd=True,
inputCol='raw_features',
outputCol='std_features')
df_builder = VectorAssembler(inputCols=['std_features']+
[l+'_indx' for l in categoricals],
outputCol='features')
rf = RandomForestClassifier()
数据预处理管道。我假定所有数据都将在以后使用(?):
pipe_pre = Pipeline(stages=[y_indexer]+x_indexers+\
[assembler, standardizer, df_builder])
credit_pre = pipe_pre.fit(credit).transform(credit).\
select('label', 'features').cache()
将网格参数和分析管道传递给StratifiedCrossValidator(这只是为了使代码正常工作,网格参数和nfolds是虚拟选择):
grid_params = ParamGridBuilder()\
.addGrid(rf.maxDepth, [3,6]) \
.addGrid(rf.numTrees, [5,10]).build()
pipe_ana = Pipeline(stages=[rf])
crossval = StratifiedCrossValidator(
estimator=pipe_ana,
estimatorParamMaps=grid_params,
evaluator=BinaryClassificationEvaluator(),
numFolds=2)
model = crossval.fit(credit_pre)
results = model.transform(credit_pre)
predictionLabels = results.select("prediction", "label")
metrics = BinaryClassificationMetrics(predictionLabels.rdd)
metrics.areaUnderROC
在这一点上,一切看起来都合理。但是,目标是使用PySpark运行网格搜索以并行化任务。就此而言,这种合理的方法对优化有何建议?
谢谢!