将数组拆分为spark

时间:2016-09-04 14:13:59

标签: python apache-spark pyspark

我想将带有单列的数据框拆分为单个列。主要是获取表格格式.Raw数据格式是一个json文件,我已经过滤了最终集。

数据框名称:结果

数据帧架构:result.printSchema()

root
|-- results: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- 50th Percentile: string (nullable = true)
|    |    |-- 90th Percentile: string (nullable = true)
|    |    |-- 95th Percentile: string (nullable = true)
|    |    |-- 99th Percentile: string (nullable = true)
|    |    |-- Avg: string (nullable = true)
|    |    |-- Count: string (nullable = true)

结果输出:result.show()

+--------------------+
|             results|
+--------------------+
|[[0.390000,1.600...|
+--------------------+

result.collect()
[Row(results=[Row(50th Percentile=u'0.390000', 90th Percentile=u'1.600000', 95th Percentile=u'1.000000', 99th Percentile=u'2.000000', Avg=u'10.108981', Count=u'12')])]

我试图:

result_d=result.withColumn('new_col', split(result.results, ',')[0])

得到例外:

pyspark.sql.utils.AnalysisException: u'cannot resolve \'split(json.`results`, ",")\' due to data type mismatch: argument 1 requires string type, however, \'json.`results`\' is of array<struct<50th Percentile:string,90th Percentile:string,95th Percentile:string,99th Percentile:string,Avg:string,Count:string>> type.;'

尝试转换为pandas数据帧,然后拆分:

p=result.toPandas()

然后尝试将嵌套字符串拆分为列

p['col'].str.split(",")

但获得了NaN输出

0   NaN

我是spark的新手。指导正确的方法来进行这些类型的转换。

0 个答案:

没有答案