Question

我目前正在使用pyspark和出色的语言游戏数据集，其中包含几个示例作为json对象，如下所示。

每个样本都代表游戏的一个实例，其中有人聆听了具有某种口头语言的音频文件，然后应从她刚刚听到的四种可能的语言中进行选择。

我现在想在“ target”字段和“ guess”字段中增加所有这些游戏，然后计算每对游戏的数量（“ target”，“ guess”）。有人可以帮我完成此事吗？

我已经看过pyspark documentation，但是由于我是python / pyspark的新手，所以它并不真正了解聚合函数的工作原理。

{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
 "choices": ["Hindi", "Lao", "Maltese", "Turkish"],
 "guess": "Maltese", "date": "2013-08-19", "country": "AU"}

Answer 1

可以将json数据转换为pyspark数据帧的过程。

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import json

sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)

def convert_single_object_per_line(json_list):
    json_string = ""
    for line in json_list:
        json_string += json.dumps(line) + "\n"
    return json_string

json_list = [{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
 "choices": ["Hindi", "Lao", "Maltese", "Turkish"],
 "guess": "Maltese", "date": "2013-08-19", "country": "AU"}]


json_string = convert_single_object_per_line(json_list)

df = sqlContext.createDataFrame([json.loads(line) for line in json_string.splitlines()])


[In]:df
[Out]:
DataFrame[choices: array<string>, country: string, date: string, guess: string, sample: string, target: string]
[In]:df.show()
[Out]:
+--------------------+-------+----------+-------+--------------------+-------+
|             choices|country|      date|  guess|              sample| target|
+--------------------+-------+----------+-------+--------------------+-------+
|[Hindi, Lao, Malt...|     AU|2013-08-19|Maltese|af0e25c7637fb0dcd...|Turkish|
+--------------------+-------+----------+-------+--------------------+-------+

用pyspark聚合json数据

1 个答案: