如何在pyspark 1.6.1中将rdd转换为dataframe?

时间:2017-10-10 03:52:20

标签: pyspark rdd

有关如何将rdd转换为dataframe并将数据帧转换回pyspark 1.6.1中的rdd的任何示例? toDF()不能在1.6.1中使用?

例如,我有一个像这样的rdd:

data = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
                       ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])

2 个答案:

答案 0 :(得分:0)

如果由于某种原因你无法使用.toDF()方法,我建议的解决方案就是:

data = sqlContext.createDataFrame(sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
                   ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]))

这将创建一个名为" _n"的DF。其中n是列的编号。如果您想重命名列,我建议您查看以下帖子:How to change dataframe column names in pyspark?。但你需要做的就是:

data_named = data.selectExpr("_1 as One", "_2 as Two", "_3 as Three", "_4 as Four", "_5 as Five")

现在让我们看看DF:

data_named.show()

这将输出:

+---+---+-----+----+----+
|One|Two|Three|Four|Five|
+---+---+-----+----+----+
|  a|  b|    c|   1|   4|
|  o|  u|    w|   9|   3|
|  s|  q|    a|   8|   6|
|  l|  g|    z|   8|   3|
|  a|  b|    c|   9|   8|
|  s|  q|    a|  10|  10|
|  l|  g|    z|  20|  20|
|  o|  u|    w|  77|  77|
+---+---+-----+----+----+

编辑:再试一次,因为你应该可以在spark 1.6.1中使用.toDF()

答案 1 :(得分:0)

我没有看到为什么rdd.toDF无法在pyspark中用于spark 1.6.1的原因。请查看spark {1.6} python docs,例如toDF()https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext

根据您的要求,

rdd = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])

#rdd to dataframe
df = rdd.toDF() 
## can provide column names like df2 = df.toDF('col1', 'col2','col3,'col4') 

#dataframe to rdd
rdd2 = df.rdd