Question

当我构建Spark模型并调用它时，预测需要几十毫秒才能返回。但是，当我保存相同的模型，然后加载它时，预测需要更长的时间。我应该使用某种缓存吗？

加载后

model.cache()不起作用，因为模型不是RDD。

这很有效：

from pyspark.mllib.recommendation import ALS
from pyspark import SparkContext
import time

sc = SparkContext()

# Some example data
r = [(1, 1, 1.0),
    (1, 2, 2.0),
    (2, 1, 2.0)]
ratings = sc.parallelize(r)
model = ALS.trainImplicit(ratings, 1, seed=10)

# Call model and time it
now = time.time()
for t in range(10):
    model.predict(2, 2)

elapsed = (time.time() - now)*1000/(t+1)

print "Average time for model call: {:.2f}ms".format(elapsed)

model.save(sc, 'my_spark_model')

输出：Average time for model call: 71.18ms

如果我执行以下操作，预测需要花费更多时间：

from pyspark.mllib.recommendation import MatrixFactorizationModel
from pyspark import SparkContext
import time

sc = SparkContext()

model_path = "my_spark_model"
model = MatrixFactorizationModel.load(sc, model_path)

# Call model and time it
now = time.time()
for t in range(10):
    model.predict(2, 2)

elapsed = (time.time() - now)*1000/(t+1)

print "Average time for loaded model call: {:.2f}ms".format(elapsed)

输出：Average time for loaded model call: 180.34ms

对于BIG模型，我在加载已保存的模型后看到一次调用的预测时间超过10秒。

Answer 1

简而言之：否，它似乎不会缓存整个模型，因为它不是RDD。

Yu可以尝试使用cache()，但你不能缓存模型本身，因为它不是RDD，所以试试这个：

model.productFeatures().cache()
model.userFeatures().cache()

在您不需要它之后，建议unpersist()使用它们，特别是如果您处理的是非常大的数据，因为您需要保护您的作业免受内存不足错误的影响。

当然，您可以使用persist()代替cache();您可能需要阅读：What is the difference between cache and persist?

请记住，Spark会执行转换懒惰，因此当您加载模型时，实际上没有任何事情发生。它需要一个动作来触发实际工作（即当你真正使用model时，Spark会尝试加载它，导致你遇到一些延迟，而不是在内存中。

_{另请注意：cache()是惰性的，因此您可以明确地使用RDD.count()加载到内存中。}

实验输出：

Average time for model call: 1518.83ms
Average time for loaded model call: 2352.70ms
Average time for loaded model call with my suggestions: 8886.61ms

顺便说一句，在加载模型后，你应该收到这种警告：

16/08/24 00:14:05 WARN MatrixFactorizationModel: User factor does not have a partitioner. Prediction on individual records could be slow.
16/08/24 00:14:05 WARN MatrixFactorizationModel: User factor is not cached. Prediction could be slow.

但是，如果我做计数技巧怎么办？我根本不会得到任何好处，事实上我会慢一点：

...
model.productFeatures().cache()
model.productFeatures().count()
model.userFeatures().cache()
model.userFeatures().count()
...

输出：

Average time for loaded model call: 13571.14ms

没有cache()，保留count()，我得到了：

Average time for loaded model call: 9312.01ms

重要说明：在真实世界的集群中执行计时，其中节点被赋予重要作业，因此我的玩具示例可能在实验期间被抢占。此外，通信成本可能占主导地位。

所以，如果我是你，我也会自己进行实验。

总之，除了那个之外，Spark似乎没有任何其他机制来缓存您的模型。

如何加载Spark模型以进行有效预测

1 个答案: