Scala - 创建IndexedDatasetSpark对象

时间:2016-05-06 13:01:59

标签: mongodb scala apache-spark mahout mahout-recommender

我想对从mongodb获得的数据运行Spark RowSimilarity推荐程序。为此,我在下面编写了从mongo获取输入的代码,将其转换为对象的RDD。这需要传递给IndexedDataSetSpark,然后传递给SimilarityAnalysis.rowSimilarityIDS

import org.apache.hadoop.conf.Configuration
import org.apache.mahout.math.cf.SimilarityAnalysis
import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
import org.apache.spark.rdd.{NewHadoopRDD, RDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.bson.BSONObject
import com.mongodb.hadoop.MongoInputFormat

object SparkExample extends App {
  val mongoConfig = new Configuration()
  mongoConfig.set("mongo.input.uri", "mongodb://my_mongo_ip:27017/db.collection")

  val sparkConf = new SparkConf()
  val sc = new SparkContext("local", "SparkExample", sparkConf)

  val documents: RDD[(Object, BSONObject)] = sc.newAPIHadoopRDD(
    mongoConfig,
    classOf[MongoInputFormat],
    classOf[Object],
    classOf[BSONObject]
  )
  val new_doc: RDD[(String, String)] = documents.map(
    doc1 => (
    doc1._2.get("product_id").toString(),
    doc1._2.get("product_attribute_value").toString().replace("[ \"", "").replace("\"]", "").split("\" , \"").map(value => value.toLowerCase.replace(" ", "-")).mkString(" ")
    )
  )
  var myIDs = IndexedDatasetSpark(new_doc)(sc) 

  SimilarityAnalysis.rowSimilarityIDS(myIDs).dfsWrite("hdfs://myhadoop:9000/myfile", readWriteSchema)

我无法创建可以传递给SimilarityAnalysis.rowSimilarityIDS的IndexedDatasetSpark。请帮我解决这个问题。

EDIT1:

我设法创建了IndexedDatasetSpark对象,现在代码正确编译。我必须将(sc)作为隐式参数添加到IndexedDatasetSpark以运行代码:

Error: could not find implicit value for parameter sc: org.apache.spark.SparkContext

现在,当我运行它时,它会给出以下错误:

Error: could not find implicit value for parameter sc: org.apache.mahout.math.drm.DistributedContext

我无法弄清楚如何提供DistributedContext。

这是创建RDD并将其转换为IDS以便可以由rowSimilarityIDS处理的正确方法吗?

更多背景信息:我从这种情况开始:Run Mahout RowSimilarity recommender on MongoDB data

我的build.sbt:

name := "scala-mongo"

version := "1.0"

scalaVersion := "2.10.6"

libraryDependencies += "org.mongodb" %% "casbah" % "3.1.1"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.mongodb.mongo-hadoop" % "mongo-hadoop-core" % "1.4.2"

libraryDependencies ++= Seq(
  "org.apache.hadoop" % "hadoop-client" % "2.6.0" exclude("javax.servlet", "servlet-api") exclude ("com.sun.jmx", "jmxri") exclude ("com.sun.jdmk", "jmxtools") exclude ("javax.jms", "jms") exclude ("org.slf4j", "slf4j-log4j12") exclude("hsqldb","hsqldb"),
  "org.scalatest" % "scalatest_2.10" % "1.9.2" % "test"
)

libraryDependencies += "org.apache.mahout" % "mahout-math-scala_2.10" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-spark_2.10" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-math" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-hdfs" % "0.11.2"

resolvers += "typesafe repo" at "http://repo.typesafe.com/typesafe/releases/"

resolvers += Resolver.mavenLocal

Edit2:我暂时删除了dfsWrite以让代码执行并偶然发现以下错误:

java.io.NotSerializableException: org.apache.mahout.math.DenseVector
Serialization stack:
- object not serializable (class: org.apache.mahout.math.DenseVector, value: {3:1.0,8:1.0,10:1.0})
- field (class: scala.Some, name: x, type: class java.lang.Object)
- object (class scala.Some, Some({3:1.0,8:1.0,10:1.0}))
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:240)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

是否有一些我可能跳过的序列化?

1 个答案:

答案 0 :(得分:0)

我放回你删除的内容,次要错误是自己造成的。

原始错误是因为您还没有创建SparkContext,可以这样做:

implicit val mc = mahoutSparkContext()

此后我认为mc(SparkDistributedContext)到sc(SparkContext)的隐式转换将由包帮助函数处理。如果sc仍然缺失,请尝试:

implicit val sc = sdc2sc(mc)