我有一个数据框,并且确实将其转换为rdd,但是当我应用split函数时,我收到一条错误消息
这是我的数据框
;WITH nums AS
(SELECT 1 AS value
UNION ALL
SELECT value + 1 AS value
FROM nums
WHERE nums.value <= 30)
SELECT *
FROM nums
我确实转换为list和rdd
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
然后申请
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
“ AttributeError:'list'对象没有属性'split'”
但是rdd_df不是列表,让我们检查一下
rdd_df.map(lambda z: z.split(","))
可能是什么问题?我想映射并添加第3列。所需的输出将类似于;
type(rdd_df)
pyspark.rdd.RDD
谢谢。