我必须使用Apache Spark和Scala作为编程语言在数据集上执行以下任务:
deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626 6522013,2525,20150613
按设备ID对数据进行分组。因此,我们现在有一个deviceid =>的地图(字节,EVENTDATE)
对于每个设备,按eventdate对集进行排序。我们现在有一组有序的字节集,基于每个设备的eventdate。
从此有序集中选择最后30天的字节。
使用30的时间段查找上次日期的移动平均字节数。
使用30的时间段查找最终日期的字节标准差。
在结果中返回两个值(mean-k stddev)和(mean + k stddev)[假设k = 3]
我正在使用Apache Spark 1.3.0。实际数据集更宽,最终必须运行十亿行。
以下是数据集的数据结构。
package com.testing
case class DeviceAggregates (
device_id: Integer,
bytes: Long,
eventdate: Integer
) extends Ordered[DailyDeviceAggregates] {
def compare(that: DailyDeviceAggregates): Int = {
eventdate - that.eventdate
}
}
object DeviceAggregates {
def parseLogLine(logline: String): DailyDeviceAggregates = {
val c = logline.split(",")
DailyDeviceAggregates(c(0).toInt, c(1).toLong, c(2).toInt)
}
}
DeviceAnalyzer类如下所示:
package com.testing
import com.testing.DeviceAggregates
import org.apache.spark.{SparkContext, SparkConf}
import scala.util.Sorting
object DeviceAnalyzer {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Device Statistics Analyzer")
val sc = new SparkContext(sparkConf)
val logFile = args(0)
val deviceAggregateLogs = sc.textFile(logFile).map(DeviceAggregates.parseLogLine).cache()
val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)
deviceIdsMap.foreach(
// I am stuck here !!
})
sc.stop()
}
}
但是,除此之外,我仍然坚持使用此算法的实际实现。
答案 0 :(得分:0)
我有一个非常粗略的实施工作,但它没有达到标准。对不起,我是Scala / Spark的新手,所以我的问题非常基础。这就是我现在所拥有的:
import com.testing.DailyDeviceAggregates
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import scala.util.Sorting
object DeviceAnalyzer {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Device Analyzer")
val sc = new SparkContext(sparkConf)
val logFile = args(0)
val deviceAggregateLogs = sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()
// Calculate statistics based on bytes
val deviceIdsMap = deviceAggregateLogs.groupBy(_.device_id)
deviceIdsMap.foreach(a => {
val device_id = a._1 // This is the device ID
val allaggregates = a._2 // This is an array of all device-aggregates for this device
println(allaggregates)
val sortedAggregates = Sorting.quickSort(allaggregates.toArray) // Sort the CompactBuffer of DailyDeviceAggregates based on eventdate
println(sortedAggregates) // This does not work - returns an empty array !!
val byteValues = allaggregates.map(dda => dda.bytes.toDouble).toArray // This should be sortedAggregates.map (but does not compile)
val count = byteValues.count(A => true)
val sum = byteValues.sum
val xbar = sum / count
val sum_x_minus_x_bar_square = byteValues.map(x => (x-xbar)*(x-xbar)).sum
val stddev = math.sqrt(sum_x_minus_x_bar_square / count)
val vector: Vector = Vectors.dense(byteValues)
println(vector)
println(device_id + "," + xbar + "," + stddev)
//val vector: Vector = Vectors.dense(byteValues)
//println(vector)
//val summary: MultivariateStatisticalSummary = Statistics.colStats(vector)
})
sc.stop()
}
}
如果有人可以建议改进以下内容,我将非常感激: