Question

我正在努力在pyspark 2.4上进行流水线化之后保存我的模型，以便以后可以调用我的模型。尝试保存模型时遇到一个问题，我的代码的较短版本如下所示：

培训功能

    def main(spark, train_file, model_path):
        train = spark.read.parquet(train_file)
        indexer = StringIndexer(...) # convert the string to numerical representation
        als = ALS(...) # this is my ALS model with hand put toy case
        pipeline = Pipeline(stages = [indexer,als])
        model = pipeline.fit(train) # now fit this pipeline to my training data 
        model.write.overwrite().save(model_path)

    if __name__ == "__main__":

        # Create the spark session object
        spark= SparkSession.builder.appName('an_example').getOrCreate()

        # Get the filename from the command line
        train_file = sys.argv[1]
        model_file = sys.argv[2]

        # Call our main routine
        main(spark,train_file, model_file)

测试功能

    def main(spark, model_path, test_file):
        test = spark.read.parquet(test_file)
        model = Pipeline.load(model_path)
        # call an attribute from model
        userRecs = model.recommendForAllusers(10)

    if __name__ == "__main__":
        # repeat of what I have up there, with a new appName created

现在每次调用它时，都会遇到以下错误：

AttributeError：“ PipelineModel”对象没有属性“ recommendForAllUsers”

多次尝试解决该错误 我相信错误正在发生，因为现在我的模型是PipelineModel对象，要使其重新回到模型对象中，我需要更改保存模型的方式

首次尝试：

sc = SparkContext.getOrCreate()
model.save(sc, model_file)

TypeError：save（）接受2个位置参数，但给出了3个

那是因为我的sc有两个组成部分：<SparkContext master=yarn appName=an_example>

第二次尝试：结合训练和测试：令人惊讶的是，如果我在安装model.recommendForAllusers后立即致电model = pipeline.fit(train)，它仍然会给我同样的属性错误！

现在，我被卡住了。我想知道两个问题：

（1）使用spark context sc保存模型的正确方法是什么？

（2）我认为有多种方法可以正确调用模型，并且仍然可以访问模型属性。我认为，如果我在（1）中解决了问题，则可以通过model = als.load(sc,model_path)进行调用，其余的代码也可以正常工作。但我想确认一下。

需要一种正确的方法来调用我的模型，以避免pipelineModel属性错误

0 个答案: