如何使用PySpark保存IDFmodel

时间:2015-08-31 14:02:42

标签: apache-spark pyspark apache-spark-mllib

我用PySpark和ipython笔记本生成了一个IDFModel,如下所示:

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

hashingTF = HashingTF()   #this will be used with hashing later

txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory

split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want

tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set

tf_train.cache()

idf_train = IDF().fit(tf_train)    #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!

tfidf_train = idf_train.transform(tf_train)

这是基于本指南https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html。我想保存这个模型,以便稍后在另一个笔记本中加载它。但是,没有信息如何做到这一点,我发现最接近的是:

Save Apache Spark mllib model in python

但是当我在答案中尝试了这个建议时

idf_train.save(sc, "/home/ubuntu/newfolder")

我收到错误代码

AttributeError: 'IDFModel' object has no attribute 'save'

是否有我遗漏的东西或者无法解决IDFModel对象?谢谢!

1 个答案:

答案 0 :(得分:1)

我在Scala / Java中做过类似的事情。它似乎有效,但可能效率不高。我们的想法是将文件写为序列化对象,稍后再读取。祝好运! :)

try {
  val fileOut:FileOutputStream  = new FileOutputStream(savePath+"/idf.jserialized");
  val out:ObjectOutputStream  = new ObjectOutputStream(fileOut);
  out.writeObject(idf);
  out.close();
  fileOut.close();
  System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
  case foe:FileNotFoundException => foe.printStackTrace()
  case ioe:IOException => ioe.printStackTrace()
}