我是Scala和Spark的新手,我不知道该怎么做。
我已经预处理了一个CSV文件,得到了包含以下格式列表的RDD:
List("2014-01-01T23:56:06.0", NaN, 1, NaN)
List("2014-01-01T23:56:06.0", NaN, NaN, 2)
所有列表具有相同数量的元素。
我想要做的是合并具有相同第一个元素(时间戳)的列表。例如,我希望这两个示例列表仅生成一个具有以下值的列表:
List("2014-01-01T23:56:06.0", NaN, 1, 2)
感谢您的帮助:)
答案 0 :(得分:1)
# Below can help you in achieving your target
val input_rdd1 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "NaN", "1", "NaN")))
val input_rdd2 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "NaN", "NaN", "2")))
//added one more row for your data
val input_rdd3 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "2", "NaN", "NaN")))
val input_df1 = input_rdd1.toDF("col1", "col2", "col3", "col4")
val input_df2 = input_rdd2.toDF("col1", "col2", "col3", "col4")
val input_df3 = input_rdd3.toDF("col1", "col2", "col3", "col4")
val output_df = input_df1.union(input_df2).union(input_df3).groupBy($"col1").agg(min($"col2").as("col2"), min($"col3").as("col3"), min($"col4").as("col4"))
output_df.show
output:
+--------------------+----+----+----+
| col1|col2|col3|col4|
+--------------------+----+----+----+
|2014-01-01T23:56:...| 2| 1| 2|
+--------------------+----+----+----+
答案 1 :(得分:0)
如果数组尾部值是双精度值,则可以通过这种方式来实现(如sachav建议):
val original = sparkContext.parallelize(
Seq(
List("2014-01-01T23:56:06.0", NaN, 1.0, NaN),
List("2014-01-01T23:56:06.0", NaN, NaN, 2.0)
)
)
val result = original
.map(v => v.head -> v.tail)
.reduceByKey(
(acc, curr) => acc.zip(curr).map({ case (left, right) => if (left.asInstanceOf[Double].isNaN) right else left }))
.map(v => v._1 :: v._2)
result.foreach(println)
输出为:
List(2014-01-01T23:56:06.0, NaN, 1.0, 2.0)