用pyspark聚合json数据

时间:2019-05-17 09:36:17

标签: json pyspark

我目前正在使用pyspark和出色的语言游戏数据集,其中包含几个示例作为json对象,如下所示。

每个样本都代表游戏的一个实例,其中有人聆听了具有某种口头语言的音频文件,然后应从她刚刚听到的四种可能的语言中进行选择。

我现在想在“ target”字段和“ guess”字段中增加所有这些游戏,然后计算每对游戏的数量(“ target”,“ guess”)。 有人可以帮我完成此事吗?

我已经看过pyspark documentation,但是由于我是python / pyspark的新手,所以它并不真正了解聚合函数的工作原理。

{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
 "choices": ["Hindi", "Lao", "Maltese", "Turkish"],
 "guess": "Maltese", "date": "2013-08-19", "country": "AU"} 

1 个答案:

答案 0 :(得分:0)

可以将json数据转换为pyspark数据帧的过程。

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import json

sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)

def convert_single_object_per_line(json_list):
    json_string = ""
    for line in json_list:
        json_string += json.dumps(line) + "\n"
    return json_string

json_list = [{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
 "choices": ["Hindi", "Lao", "Maltese", "Turkish"],
 "guess": "Maltese", "date": "2013-08-19", "country": "AU"}]


json_string = convert_single_object_per_line(json_list)

df = sqlContext.createDataFrame([json.loads(line) for line in json_string.splitlines()])


[In]:df
[Out]:
DataFrame[choices: array<string>, country: string, date: string, guess: string, sample: string, target: string]
[In]:df.show()
[Out]:
+--------------------+-------+----------+-------+--------------------+-------+
|             choices|country|      date|  guess|              sample| target|
+--------------------+-------+----------+-------+--------------------+-------+
|[Hindi, Lao, Malt...|     AU|2013-08-19|Maltese|af0e25c7637fb0dcd...|Turkish|
+--------------------+-------+----------+-------+--------------------+-------+