在Spark 1.5.2中使用dataframe和rdd有什么区别?

时间:2015-12-28 05:07:54

标签: mongodb apache-spark apache-spark-sql

我从MongoDB读取数据,然后映射到InteractionItem。

 val df = filterByParams(startTs, endTs, widgetIds, documents)
    .filter(item => {
      item._2.get("url") != "localhost" && !EXCLUDED_TRIGGERS.contains(item._2.get("trigger"))
    })
    .flatMap(item => {
      var res = Array[InteractionItem]()

      try {
        val widgetId = item._2.get("widgetId").toString
        val timestamp = java.lang.Long.parseLong(item._2.get("time").toString)
        val extra = item._2.get("extra").toString
        val extras = parseExtra(extra)
        val c = parseUserAgent(extras.userAgent.getOrElse(""))
        val os = c.os.family
        val osVersion = c.os.major
        val device = c.device.family
        val browser = c.userAgent.family
        val browserVersion = c.userAgent.major
        val adUnit = extras.adunit.get
        val gUid = extras.guid.get
        val trigger = item._2.get("trigger").toString
        val objectName = item._2.get("object").toString
        val response = item._2.get("response").toString
        val ts: Long = timestamp - timestamp % 3600


        //
        val interaction = interactionConfiguration.filter(interaction =>
          interaction.get("trigger") == trigger &&
            interaction.get("object") == objectName &&
            interaction.get("response") == response).head
        val clickThrough = interaction.get("clickThrough").asInstanceOf[Boolean]
        val interactionId = interaction.get("_id").toString

        adUnitPublishers.filter(x => x._2._2.toString == widgetId && x._1.toString == adUnit).foreach(publisher => {
          res = res :+ InteractionItem(widgetId, ts, adUnit, publisher._2._1.toString, os, osVersion, device, browser, browserVersion,
            interactionId, clickThrough, 1L, gUid)
        })
        bdPublishers.filter(x => x._1.toString == widgetId).foreach(publisher => {
          res = res :+ InteractionItem(widgetId, ts, adUnit, publisher._2.toString, os, osVersion, device, browser, browserVersion,
            interactionId, clickThrough, 1L, gUid)
        })
      }
      catch {
        case e: Exception => {
          log.info(e.getMessage)
          res = res :+ InteractionItem.invalid()
        }
      }
      res

    }).filter(i => i.interactionCount > 0)

使用RDD方式我再次映射并reduceByKey

.map(i => ((i.widgetId, i.date, i.section, i.publisher, i.os, i.device, i.browser, i.clickThrough, i.id), i.interactionCount))
        .reduceByKey((a, b) => a + b)

使用DataFrame方式我转换

.toDF()

            df.registerTempTable("interactions")
            df.cache()
            val v = sqlContext.sql("SELECT id, clickThrough, widgetId, date, section, publisher, os, device, browser, interactionCount" +
              " FROM interactions GROUP BY id, clickThrough, widgetId, date, section, publisher, os, device, browser, interactionCount")

从我在Spark UI中看到的 对于使用Dataframe,它需要210个阶段? enter image description here

对于RDD,它只有20个阶段:

enter image description here

我在这里做错了什么?

1 个答案:

答案 0 :(得分:0)

您在RDD& DF不一样。
DF具有更长处理时间的原因是由于以下额外任务:

  1. registerTempTable()
  2. 缓存()
  3. 虽然RDD仅减少一个给定表达式,但DF会将整个数据作为表处理,并且还会准备缓存,从而消耗额外的CPU和存储资源。 / p>