我有一张包含以下内容的地图列表:
fields = [{"trials": 1.0, "name": "Alice", "score": 8.0}, {"trials": 2.0, "name": "Bob", "score": 10.0"}]
地图列表从API调用作为JSON Blob返回。当我将其转换为PySpark中的数据框时,将得到以下内容:
+-------------------------------------------+---------+
|fields |key |
+-------------------------------------------+---------+
|[1.0, Alice, 8.0] |key1 |
|[2.0, Bob, 10.0] |key2 |
|[1.0, Charlie, 8.0] |key3 |
|[2.0, Sue, 10.0] |key4 |
|[1.0, Clark, 8.0] |key5 |
|[3.0, Sarah, 10.0] |key6 |
我想把它变成这种形式:
+-------------------------------------------+---------+
|trials| name | score |key |
+-------------------------------------------+---------+
|1.0 |Alice | 8.0 |key1 |
|2.0 | Bob | 10.0 |key2 |
|1.0 |Charlie| 8.0 |key3 |
|2.0 |Sue | 10.0 |key4 |
|1.0 |Clark | 8.0 |key5 |
|3.0 |Sarah | 10.0 |key6 |
解决此问题的最佳方法是什么?这是我到目前为止的内容:
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
rdd = sc.parallelize(results)
df = sqlContext.read.json(rdd)
df.show()