我有:
val DF1 = sparkSession.sql("select col1,col2,col3 from table");
val tupleList = DF1.select("col1","col2").rdd.map(r => (r(0),r(1))).collect()
tupleList.foreach(x=> x.productIterator.foreach(println))
但是我没有得到输出中的所有元组。这个问题在哪里?
col1 col2
AA CCC
AA BBB
DD CCC
AB BBB
Others BBB
GG ALL
EE ALL
Others ALL
ALL BBB
NU FFF
NU Others
Others Others
C FFF
我得到的输出是:
CCC AA BBB AA Others AA Others DD ALL Others ALL GG ALL ALL
答案 0 :(得分:9)
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val df1 = hiveContext.sql("select id, name from class_db.students")
scala> df1.show()
+----+-------+
| id| name|
+----+-------+
|1001| John|
|1002|Michael|
+----+-------+
scala> df1.select("id", "name").rdd.map(x => (x.get(0), x.get(1))).collect()
res3: Array[(Any, Any)] = Array((1001,John), (1002,Michael))
答案 1 :(得分:0)
要解决pyspark使用中无效的语法问题
temp = df1.select('id','name')。rdd.map(lambda x:(x [0],x [1]))。collect()