我想将带有单列的数据框拆分为单个列。主要是获取表格格式.Raw数据格式是一个json文件,我已经过滤了最终集。
数据框名称:结果
数据帧架构:result.printSchema()
root
|-- results: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- 50th Percentile: string (nullable = true)
| | |-- 90th Percentile: string (nullable = true)
| | |-- 95th Percentile: string (nullable = true)
| | |-- 99th Percentile: string (nullable = true)
| | |-- Avg: string (nullable = true)
| | |-- Count: string (nullable = true)
结果输出:result.show()
+--------------------+
| results|
+--------------------+
|[[0.390000,1.600...|
+--------------------+
result.collect()
[Row(results=[Row(50th Percentile=u'0.390000', 90th Percentile=u'1.600000', 95th Percentile=u'1.000000', 99th Percentile=u'2.000000', Avg=u'10.108981', Count=u'12')])]
我试图:
result_d=result.withColumn('new_col', split(result.results, ',')[0])
得到例外:
pyspark.sql.utils.AnalysisException: u'cannot resolve \'split(json.`results`, ",")\' due to data type mismatch: argument 1 requires string type, however, \'json.`results`\' is of array<struct<50th Percentile:string,90th Percentile:string,95th Percentile:string,99th Percentile:string,Avg:string,Count:string>> type.;'
尝试转换为pandas数据帧,然后拆分:
p=result.toPandas()
然后尝试将嵌套字符串拆分为列
p['col'].str.split(",")
但获得了NaN输出
0 NaN
我是spark的新手。指导正确的方法来进行这些类型的转换。