data.csv文件中的数据是:
07:36:00 PM 172.20.16.107 104.70.250.141 80 57188 0.48
07:33:00 PM 172.20.16.105 104.70.250.141 80 57188 0.66
07:34:00 PM 172.20.16.105 104.70.250.141 80 57188 0.47
07:35:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:44:00 PM 172.20.16.106 104.70.250.141 80 57188 0.49
07:45:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:46:00 PM 172.20.16.106 104.70.250.141 80 57188 0.33
07:47:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:48:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:36:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:37:00 PM 172.20.16.107 104.70.250.141 80 57188 0.48
07:37:00 PM 172.20.16.105 104.70.250.141 80 57188 0.66
07:38:00 PM 172.20.16.105 104.70.250.141 80 57188 0.47
07:39:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:50:00 PM 172.20.16.106 104.70.250.141 80 57188 0.49
07:51:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:52:00 PM 172.20.16.106 104.70.250.141 80 57188 0.33
07:53:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:54:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:40:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
这是我的代码:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object ScalaApp {
def main(args: Array[String]) {
val sc = new SparkContext("local[4]", "Program")
// we take the raw data in CSV format and convert it into a
val data = sc.textFile("data.csv")
.map(line => line.split(","))
.map(GroupRecord => (GroupRecord(0),
GroupRecord(1),GroupRecord(2),GroupRecord(3),GroupRecord(4),GroupRecord(5)))
val numPurchases = data.count()
val d1=data.groupByKey(GroupRecord(2)) // here is the error
println("No: " + numPurchases)
println("Grouped Data" + d1)
}
}
我只想要按源-IP(第2列)和按时间顺序(第1列)分组的相同数据。 所以我的需求数据是:
07:33:00 PM 172.20.16.105 104.70.250.141 80 57188 0.66
07:34:00 PM 172.20.16.105 104.70.250.141 80 57188 0.47
07:35:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:37:00 PM 172.20.16.105 104.70.250.141 80 57188 0.66
07:38:00 PM 172.20.16.105 104.70.250.141 80 57188 0.47
07:39:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:40:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:44:00 PM 172.20.16.106 104.70.250.141 80 57188 0.49
07:45:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:46:00 PM 172.20.16.106 104.70.250.141 80 57188 0.33
07:47:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:50:00 PM 172.20.16.106 104.70.250.141 80 57188 0.49
07:51:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:52:00 PM 172.20.16.106 104.70.250.141 80 57188 0.33
07:53:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:54:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:36:00 PM 172.20.16.107 104.70.250.141 80 57188 0.48
07:37:00 PM 172.20.16.107 104.70.250.141 80 57188 0.48
但我的代码有问题所以请帮助我!
答案 0 :(得分:3)
您的问题是您的第二个map
创建了一个Tuple6
而不是一个键值对,如果您想要执行xxxByKey操作,则需要这样做。如果您想按第二列进行分组,则应该使用GroupRecord(1)
密钥和其余值,然后调用groupByKey
,如下所示:
data
.map(GroupRecord => (GroupRecord(1),(GroupRecord(0),GroupRecord(2),GroupRecord(3),GroupRecord(4),GroupRecord(5)))
.groupByKey()
答案 1 :(得分:3)
上述解决方案可行。但是当数据增加到大尺寸时,我不确定你需要处理重新洗牌
最好的方法是创建一个数据帧并使用sqlContext命令BY IP地址和时间
答案 2 :(得分:2)
正如Glennie指出的那样,你不能为groupByKey
操作创建一个键值对。但是,您也可以使用groupBy(_._3)
来获得相同的结果。为了按照第一列对每个组进行排序,您可以在分组后应用flatMapValues
对每个组中的项目进行排序。以下代码正是如此:
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val data = sc.textFile("data.csv")
.map(line => line.split("\\s+"))
.map(GroupRecord => (GroupRecord(2), (GroupRecord(0), GroupRecord(1),GroupRecord(2),GroupRecord(3),GroupRecord(4),GroupRecord(5))))
// sort the groups by the first tuple field
val result = data.groupByKey.flatMapValues(x => x.toList.sortBy(_._1))
// assign the partition ID to each item to see that each group is sorted
val resultWithPartitionID = result.mapPartitionsWithIndex((id, it) => it.map(x => (id, x)))
// print the contents of the RDD, elements of different partitions might be interleaved
resultWithPartitionID foreach println
val collectedResult = resultWithPartitionID.collect.sortBy(_._1).map(_._2)
// print collected results
println(collectedResult.mkString("\n"))
}
答案 3 :(得分:0)
这里我们需要将它转换为键值对以应用GroupByKey机制,然后将值转换为Iterable set并将排序应用于我们将其转换为序列所需的每个键的值,然后应用sortby功能,然后flatMap函数将顺序值展平为String集。
Data.csv - >
07:36:00 PM 172.20.16.107 104.70.250.141 80 57188 0.48
07:33:00 PM 172.20.16.105 104.70.250.141 80 57188 0.66
07:34:00 PM 172.20.16.105 104.70.250.141 80 57188 0.47
07:35:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:44:00 PM 172.20.16.106 104.70.250.141 80 57188 0.49
07:45:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:46:00 PM 172.20.16.106 104.70.250.141 80 57188 0.33
07:47:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:48:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:36:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:37:00 PM 172.20.16.107 104.70.250.141 80 57188 0.48
07:37:00 PM 172.20.16.105 104.70.250.141 80 57188 0.66
07:38:00 PM 172.20.16.105 104.70.250.141 80 57188 0.47
07:39:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
07:50:00 PM 172.20.16.106 104.70.250.141 80 57188 0.49
07:51:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:52:00 PM 172.20.16.106 104.70.250.141 80 57188 0.33
07:53:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:54:00 PM 172.20.16.106 104.70.250.141 80 57188 0.48
07:40:00 PM 172.20.16.105 104.70.250.141 80 57188 0.48
代码 - >
val data = sc.textFile("src/Data.csv")
.map(line => {
val GroupRecord = line.split("\t")
((GroupRecord(1)), (GroupRecord(0), GroupRecord(2), GroupRecord(3), GroupRecord(4), GroupRecord(5)))
})
val numPurchases = data.count()
val d1 = data.groupByKey().map(f => (f._1, f._2.toSeq.sortBy(f => f._1))).flatMapValues(f => f).map(f => (f._2._1, f._1, f._2._2, f._2._3, f._2._4, f._2._5))
d1 foreach (println(_))
println("No: " + numPurchases)