pyspark将数据帧转换为rdd并拆分

时间:2018-11-07 08:29:26

标签: list dataframe split rdd

我有一个数据框,并且确实将其转换为rdd,但是当我应用split函数时,我收到一条错误消息

这是我的数据框

;WITH nums AS
   (SELECT 1 AS value
    UNION ALL
    SELECT value + 1 AS value
    FROM nums
    WHERE nums.value <= 30)
SELECT *
FROM nums

我确实转换为list和rdd

df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
                        (2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
                        (3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
                        (4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
                        (5, '2013-07-25 12:23:00.0',50,'CLOSED'),
                        (6, '2013-07-26 02:00:00.0',300,'CLOSED'),
                        (7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
                        (8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
                        (9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
                        (10,'2013-07-26 1:23:00.0',30,'CLOSED')],
                        ['Id', 'Date', 'Total', 'Transaction'])

然后申请

rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)

“ AttributeError:'list'对象没有属性'split'”

但是rdd_df不是列表,让我们检查一下

 rdd_df.map(lambda z: z.split(","))

可能是什么问题?我想映射并添加第3列。所需的输出将类似于;

type(rdd_df)
pyspark.rdd.RDD

谢谢。

0 个答案:

没有答案