有关如何将rdd转换为dataframe并将数据帧转换回pyspark 1.6.1中的rdd的任何示例?
toDF()
不能在1.6.1中使用?
例如,我有一个像这样的rdd:
data = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])
答案 0 :(得分:0)
如果由于某种原因你无法使用.toDF()方法,我建议的解决方案就是:
data = sqlContext.createDataFrame(sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]))
这将创建一个名为" _n"的DF。其中n是列的编号。如果您想重命名列,我建议您查看以下帖子:How to change dataframe column names in pyspark?。但你需要做的就是:
data_named = data.selectExpr("_1 as One", "_2 as Two", "_3 as Three", "_4 as Four", "_5 as Five")
现在让我们看看DF:
data_named.show()
这将输出:
+---+---+-----+----+----+
|One|Two|Three|Four|Five|
+---+---+-----+----+----+
| a| b| c| 1| 4|
| o| u| w| 9| 3|
| s| q| a| 8| 6|
| l| g| z| 8| 3|
| a| b| c| 9| 8|
| s| q| a| 10| 10|
| l| g| z| 20| 20|
| o| u| w| 77| 77|
+---+---+-----+----+----+
编辑:再试一次,因为你应该可以在spark 1.6.1中使用.toDF()
答案 1 :(得分:0)
我没有看到为什么rdd.toDF
无法在pyspark中用于spark 1.6.1的原因。请查看spark {1.6} python docs,例如toDF()
:https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext
根据您的要求,
rdd = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])
#rdd to dataframe
df = rdd.toDF()
## can provide column names like df2 = df.toDF('col1', 'col2','col3,'col4')
#dataframe to rdd
rdd2 = df.rdd