在MongoDB数据上运行Mahout RowSimilarity推荐程序

时间:2016-05-05 07:08:56

标签: mongodb scala apache-spark mahout mahout-recommender

我已经设法在以下格式的平面文件上运行Mahout行相似度:

  

item-id tag1 tag-2 tag3

这必须通过cli运行,输出再次是平面文件。我想这样做,它从MongoDB中读取数据(也可以使用其他数据库),然后将输出转储到DB,然后可以从我们的系统中选择。

我过去几天研究过并发现了以下内容:

  • 必须编写实现RowSimilarity的Scala代码
  • 传递一个IndexedDataSet对象来处理数据
  • 将输出转换为所需格式(json / csv)

我还要弄清楚如何将数据从DB导入到IndexedDataSet。另外,我读过有关RDD格式的内容,但仍然无法弄清楚如何将json数据转换为可由RowSimilarity代码使用的RDD。

tl; dr:如何转换MongoDB数据以便mahout / spark行相似度处理?

Edit1:我找到了一些将Mongodata转换为RDD的代码:https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage#scala-example

现在我需要帮助将它转换为IndexedDataset,以便可以将其传递给SimilarityAnalysis.rowSimilarityIDS。

tl; dr:如何将RDD转换为IndexedDataset

1 个答案:

答案 0 :(得分:0)

以下是答案:

import org.apache.hadoop.conf.Configuration
import org.apache.mahout.math.cf.SimilarityAnalysis
import org.apache.mahout.math.indexeddataset.Schema
import org.apache.mahout.sparkbindings
import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
import org.apache.spark.rdd.RDD
import org.bson.BSONObject
import com.mongodb.hadoop.MongoInputFormat


object SparkExample extends App {
  implicit val mc = sparkbindings.mahoutSparkContext(masterUrl = "local", appName = "RowSimilarity")
  val mongoConfig = new Configuration()
  mongoConfig.set("mongo.input.uri", "mongodb://hostname:27017/db.collection")

  val documents: RDD[(Object, BSONObject)] = mc.newAPIHadoopRDD(
    mongoConfig,
    classOf[MongoInputFormat],
    classOf[Object],
    classOf[BSONObject]
  )

  val documents_Array: RDD[(String, Array[String])] = documents.map(
    doc1 => (
      doc1._2.get("product_id").toString(),
      doc1._2.get("product_attribute_value").toString().replace("[ \"", "").replace("\"]", "").split("\" , \"").map(value => value.toLowerCase.replace(" ", "-").mkString(" "))
    )
  )

  val new_doc: RDD[(String, String)] = documents_Array.flatMapValues(x => x)
  val myIDs = IndexedDatasetSpark(new_doc)(mc)

  val readWriteSchema = new Schema(
    "rowKeyDelim" -> "\t",
    "columnIdStrengthDelim" -> ":",
    "omitScore" -> false,
    "elementDelim" -> " "
  )
  SimilarityAnalysis.rowSimilarityIDS(myIDs).dfsWrite("hdfs://hadoop:9000/mongo-hadoop-rowsimilarity", readWriteSchema)(mc)

}

build.sbt:

name := "scala-mongo"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies += "org.mongodb" %% "casbah" % "3.1.1"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.mongodb.mongo-hadoop" % "mongo-hadoop-core" % "1.4.2"

libraryDependencies ++= Seq(
  "org.apache.hadoop" % "hadoop-client" % "2.6.0" exclude("javax.servlet", "servlet-api") exclude ("com.sun.jmx", "jmxri") exclude ("com.sun.jdmk", "jmxtools") exclude ("javax.jms", "jms") exclude ("org.slf4j", "slf4j-log4j12") exclude("hsqldb","hsqldb"),
  "org.scalatest" % "scalatest_2.10" % "1.9.2" % "test"
)
libraryDependencies += "org.apache.mahout" % "mahout-math-scala_2.10" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-spark_2.10" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-math" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-hdfs" % "0.11.2"

resolvers += "typesafe repo" at " http://repo.typesafe.com/typesafe/releases/"
resolvers += Resolver.mavenLocal

我使用mongo-hadoop从Mongo获取数据并使用它。由于我的数据有一个数组,我不得不使用flatMapValues来展平它,然后传递给IDS以获得正确的输出。

PS:我在这里发布了答案而不是linked question,因为这个Q& A涵盖了获取数据和处理数据的全部范围。