Question

data.csv文件中的数据是：

07:36:00 PM 172.20.16.107   104.70.250.141  80  57188   0.48
07:33:00 PM 172.20.16.105   104.70.250.141  80  57188   0.66
07:34:00 PM 172.20.16.105   104.70.250.141  80  57188   0.47
07:35:00 PM 172.20.16.105   104.70.250.141  80  57188   0.48
07:44:00 PM 172.20.16.106   104.70.250.141  80  57188   0.49
07:45:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:46:00 PM 172.20.16.106   104.70.250.141  80  57188   0.33
07:47:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:48:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:36:00 PM 172.20.16.105   104.70.250.141  80  57188   0.48
07:37:00 PM 172.20.16.107   104.70.250.141  80  57188   0.48
07:37:00 PM 172.20.16.105   104.70.250.141  80  57188   0.66
07:38:00 PM 172.20.16.105   104.70.250.141  80  57188   0.47
07:39:00 PM 172.20.16.105   104.70.250.141  80  57188   0.48
07:50:00 PM 172.20.16.106   104.70.250.141  80  57188   0.49
07:51:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:52:00 PM 172.20.16.106   104.70.250.141  80  57188   0.33
07:53:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:54:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:40:00 PM 172.20.16.105   104.70.250.141  80  57188   0.48

这是我的代码：

 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._

 object ScalaApp {
 def main(args: Array[String]) {
 val sc = new SparkContext("local[4]", "Program")

     // we take the raw data in CSV format and convert it into a

  val data = sc.textFile("data.csv")
 .map(line => line.split(","))

 .map(GroupRecord => (GroupRecord(0),
GroupRecord(1),GroupRecord(2),GroupRecord(3),GroupRecord(4),GroupRecord(5)))

val numPurchases = data.count()
val d1=data.groupByKey(GroupRecord(2)) // here is the error

println("No: " + numPurchases)
println("Grouped Data" + d1)

}
}

我只想要按源-IP（第2列）和按时间顺序（第1列）分组的相同数据。所以我的需求数据是：

  07:33:00 PM   172.20.16.105   104.70.250.141  80  57188   0.66
  07:34:00 PM   172.20.16.105   104.70.250.141  80  57188   0.47
  07:35:00 PM   172.20.16.105   104.70.250.141  80  57188   0.48
  07:37:00 PM   172.20.16.105   104.70.250.141  80  57188   0.66
  07:38:00 PM   172.20.16.105   104.70.250.141  80  57188   0.47
  07:39:00 PM   172.20.16.105   104.70.250.141  80  57188   0.48
  07:40:00 PM   172.20.16.105   104.70.250.141  80  57188   0.48
  07:44:00 PM   172.20.16.106   104.70.250.141  80  57188   0.49
  07:45:00 PM   172.20.16.106   104.70.250.141  80  57188   0.48
  07:46:00 PM   172.20.16.106   104.70.250.141  80  57188   0.33
  07:47:00 PM   172.20.16.106   104.70.250.141  80  57188   0.48
  07:50:00 PM   172.20.16.106   104.70.250.141  80  57188   0.49
  07:51:00 PM   172.20.16.106   104.70.250.141  80  57188   0.48
  07:52:00 PM   172.20.16.106   104.70.250.141  80  57188   0.33
  07:53:00 PM   172.20.16.106   104.70.250.141  80  57188   0.48
  07:54:00 PM   172.20.16.106   104.70.250.141  80  57188   0.48
  07:36:00 PM   172.20.16.107   104.70.250.141  80  57188   0.48
  07:37:00 PM   172.20.16.107   104.70.250.141  80  57188   0.48

但我的代码有问题所以请帮助我！

Answer 1

您的问题是您的第二个map创建了一个Tuple6而不是一个键值对，如果您想要执行xxxByKey操作，则需要这样做。如果您想按第二列进行分组，则应该使用GroupRecord(1)密钥和其余值，然后调用groupByKey，如下所示：

data
  .map(GroupRecord => (GroupRecord(1),(GroupRecord(0),GroupRecord(2),GroupRecord(3),GroupRecord(4),GroupRecord(5)))
  .groupByKey()

Answer 2

上述解决方案可行。但是当数据增加到大尺寸时，我不确定你需要处理重新洗牌

最好的方法是创建一个数据帧并使用sqlContext命令BY IP地址和时间

Answer 3

正如Glennie指出的那样，你不能为groupByKey操作创建一个键值对。但是，您也可以使用groupBy(_._3)来获得相同的结果。为了按照第一列对每个组进行排序，您可以在分组后应用flatMapValues对每个组中的项目进行排序。以下代码正是如此：

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
    val sc = new SparkContext(sparkConf)

    val data = sc.textFile("data.csv")
      .map(line => line.split("\\s+"))
      .map(GroupRecord => (GroupRecord(2), (GroupRecord(0), GroupRecord(1),GroupRecord(2),GroupRecord(3),GroupRecord(4),GroupRecord(5))))

    // sort the groups by the first tuple field
    val result = data.groupByKey.flatMapValues(x => x.toList.sortBy(_._1))

    // assign the partition ID to each item to see that each group is sorted
    val resultWithPartitionID = result.mapPartitionsWithIndex((id, it) => it.map(x => (id, x)))

    // print the contents of the RDD, elements of different partitions might be interleaved
    resultWithPartitionID foreach println

    val collectedResult = resultWithPartitionID.collect.sortBy(_._1).map(_._2)

    // print collected results
    println(collectedResult.mkString("\n"))
  }

Answer 4

这里我们需要将它转换为键值对以应用GroupByKey机制，然后将值转换为Iterable set并将排序应用于我们将其转换为序列所需的每个键的值，然后应用sortby功能，然后flatMap函数将顺序值展平为String集。

Data.csv - ＆gt;

07:36:00 PM 172.20.16.107   104.70.250.141  80  57188   0.48
07:33:00 PM 172.20.16.105   104.70.250.141  80  57188   0.66
07:34:00 PM 172.20.16.105   104.70.250.141  80  57188   0.47
07:35:00 PM 172.20.16.105   104.70.250.141  80  57188   0.48
07:44:00 PM 172.20.16.106   104.70.250.141  80  57188   0.49
07:45:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:46:00 PM 172.20.16.106   104.70.250.141  80  57188   0.33
07:47:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:48:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:36:00 PM 172.20.16.105   104.70.250.141  80  57188   0.48
07:37:00 PM 172.20.16.107   104.70.250.141  80  57188   0.48
07:37:00 PM 172.20.16.105   104.70.250.141  80  57188   0.66
07:38:00 PM 172.20.16.105   104.70.250.141  80  57188   0.47
07:39:00 PM 172.20.16.105   104.70.250.141  80  57188   0.48
07:50:00 PM 172.20.16.106   104.70.250.141  80  57188   0.49
07:51:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:52:00 PM 172.20.16.106   104.70.250.141  80  57188   0.33
07:53:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:54:00 PM 172.20.16.106   104.70.250.141  80  57188   0.48
07:40:00 PM 172.20.16.105   104.70.250.141  80  57188   0.48

代码 - ＆gt;

val data = sc.textFile("src/Data.csv")
  .map(line => {
    val GroupRecord = line.split("\t")

    ((GroupRecord(1)), (GroupRecord(0), GroupRecord(2), GroupRecord(3), GroupRecord(4), GroupRecord(5)))
  })

val numPurchases = data.count()

val d1 = data.groupByKey().map(f => (f._1, f._2.toSeq.sortBy(f => f._1))).flatMapValues(f => f).map(f => (f._2._1, f._1, f._2._2, f._2._3, f._2._4, f._2._5))

d1 foreach (println(_))

println("No: " + numPurchases)

我们如何从Spark RDD中对数据进行排序和分组？

4 个答案: