我目前正在使用pyspark和出色的语言游戏数据集,其中包含几个示例作为json对象,如下所示。
每个样本都代表游戏的一个实例,其中有人聆听了具有某种口头语言的音频文件,然后应从她刚刚听到的四种可能的语言中进行选择。
我现在想在“ target”字段和“ guess”字段中增加所有这些游戏,然后计算每对游戏的数量(“ target”,“ guess”)。 有人可以帮我完成此事吗?
我已经看过pyspark documentation,但是由于我是python / pyspark的新手,所以它并不真正了解聚合函数的工作原理。
{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
"choices": ["Hindi", "Lao", "Maltese", "Turkish"],
"guess": "Maltese", "date": "2013-08-19", "country": "AU"}
答案 0 :(得分:0)
可以将json数据转换为pyspark数据帧的过程。
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import json
sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)
def convert_single_object_per_line(json_list):
json_string = ""
for line in json_list:
json_string += json.dumps(line) + "\n"
return json_string
json_list = [{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
"choices": ["Hindi", "Lao", "Maltese", "Turkish"],
"guess": "Maltese", "date": "2013-08-19", "country": "AU"}]
json_string = convert_single_object_per_line(json_list)
df = sqlContext.createDataFrame([json.loads(line) for line in json_string.splitlines()])
[In]:df
[Out]:
DataFrame[choices: array<string>, country: string, date: string, guess: string, sample: string, target: string]
[In]:df.show()
[Out]:
+--------------------+-------+----------+-------+--------------------+-------+
| choices|country| date| guess| sample| target|
+--------------------+-------+----------+-------+--------------------+-------+
|[Hindi, Lao, Malt...| AU|2013-08-19|Maltese|af0e25c7637fb0dcd...|Turkish|
+--------------------+-------+----------+-------+--------------------+-------+