我从MongoDB读取数据,然后映射到InteractionItem。
val df = filterByParams(startTs, endTs, widgetIds, documents)
.filter(item => {
item._2.get("url") != "localhost" && !EXCLUDED_TRIGGERS.contains(item._2.get("trigger"))
})
.flatMap(item => {
var res = Array[InteractionItem]()
try {
val widgetId = item._2.get("widgetId").toString
val timestamp = java.lang.Long.parseLong(item._2.get("time").toString)
val extra = item._2.get("extra").toString
val extras = parseExtra(extra)
val c = parseUserAgent(extras.userAgent.getOrElse(""))
val os = c.os.family
val osVersion = c.os.major
val device = c.device.family
val browser = c.userAgent.family
val browserVersion = c.userAgent.major
val adUnit = extras.adunit.get
val gUid = extras.guid.get
val trigger = item._2.get("trigger").toString
val objectName = item._2.get("object").toString
val response = item._2.get("response").toString
val ts: Long = timestamp - timestamp % 3600
//
val interaction = interactionConfiguration.filter(interaction =>
interaction.get("trigger") == trigger &&
interaction.get("object") == objectName &&
interaction.get("response") == response).head
val clickThrough = interaction.get("clickThrough").asInstanceOf[Boolean]
val interactionId = interaction.get("_id").toString
adUnitPublishers.filter(x => x._2._2.toString == widgetId && x._1.toString == adUnit).foreach(publisher => {
res = res :+ InteractionItem(widgetId, ts, adUnit, publisher._2._1.toString, os, osVersion, device, browser, browserVersion,
interactionId, clickThrough, 1L, gUid)
})
bdPublishers.filter(x => x._1.toString == widgetId).foreach(publisher => {
res = res :+ InteractionItem(widgetId, ts, adUnit, publisher._2.toString, os, osVersion, device, browser, browserVersion,
interactionId, clickThrough, 1L, gUid)
})
}
catch {
case e: Exception => {
log.info(e.getMessage)
res = res :+ InteractionItem.invalid()
}
}
res
}).filter(i => i.interactionCount > 0)
使用RDD方式我再次映射并reduceByKey
.map(i => ((i.widgetId, i.date, i.section, i.publisher, i.os, i.device, i.browser, i.clickThrough, i.id), i.interactionCount))
.reduceByKey((a, b) => a + b)
使用DataFrame方式我转换
.toDF()
df.registerTempTable("interactions")
df.cache()
val v = sqlContext.sql("SELECT id, clickThrough, widgetId, date, section, publisher, os, device, browser, interactionCount" +
" FROM interactions GROUP BY id, clickThrough, widgetId, date, section, publisher, os, device, browser, interactionCount")
从我在Spark UI中看到的 对于使用Dataframe,它需要210个阶段?
对于RDD,它只有20个阶段:
我在这里做错了什么?
答案 0 :(得分:0)
您在RDD& DF不一样。
DF具有更长处理时间的原因是由于以下额外任务:
虽然RDD仅减少一个给定表达式,但DF会将整个数据作为表处理,并且还会准备缓存,从而消耗额外的CPU和存储资源。 / p>