我有org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
个数据,
如何打印数据或获取数据?
我的代码如下:
val sessionsDF = Seq(("day1","user1","session1", 100.0),
("day1","user1","session2",200.0),
("day2","user1","session3",300.0),
("day2","user1","session4",400.0),
("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal").toDF()
val groupByData=sessionsDF.groupBy(x=>(x.get(0),x.get(1))
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)
上面的代码正在返回org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
答案 0 :(得分:1)
在第一步中,您有.toDF()
个额外费用。正确一个如下
val sessionsDF = Seq(("day1","user1","session1", 100.0),
("day1","user1","session2",200.0),
("day2","user1","session3",300.0),
("day2","user1","session4",400.0),
("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal")
在第二步中,您错过了.rdd
,因此实际的第二步是
val groupByData=sessionsDF.rdd.groupBy(x=>(x.get(0),x.get(1)))
,如您在问题中提到的那样 dataType
scala> groupByData
res12: org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] = ShuffledRDD[9] at groupBy at <console>:25
要查看groupByData
rdd
,您只需使用foreach
groupByData.foreach(println)
会给你
((day1,user1),CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0]))
((day2,user1),CompactBuffer([day2,user1,session3,300.0], [day2,user1,session4,400.0], [day2,user1,session4,99.0]))
现在,您的第三步是过滤day1
中day column
值为dataframe
的数据。而且您只获取分组 rdd
数据的值。
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)
此步骤的返回 dataType 是
scala> filterData
res13: org.apache.spark.rdd.RDD[Iterable[org.apache.spark.sql.Row]] = MapPartitionsRDD[11] at map at <console>:27
您可以使用上述foreach
将数据视为
filterData.foreach(println)
会给你
CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0])
您可以看到返回的 dataType 是RDD[Iterable[org.apache.spark.sql.Row]]
,因此您可以使用map
打印每个值
filterData.map(x => x.map(y => println(y(0), y(1), y(2), y(3)))).collect
会给你
(day1,user1,session1,100.0)
(day1,user1,session2,200.0)
如果你只做
filterData.map(x => x.map(y => println(y(0), y(3)))).collect
你会得到
(day1,100.0)
(day1,200.0)
我希望答案很有帮助