如何使用spark在JSON中提取压缩的JSON?

时间:2015-12-24 09:48:33

标签: json apache-spark apache-spark-sql

我有一个包含以下行格式的JSON文件:

List<User> users = new ArrayList<>();
while(someValue) {
   ...
   int id = ...
   String firstName= ...
   String lastName = ...
   User user = new User();
   user.setId(id);
   user.setFirstName(firstName);
   user.setLastName(lastName);
   users .add(user);
}

// after do what you want with the list

gzip字段本身就是一个压缩的JSON

我想读取文件并将完整的嵌套JSON构建为一行:

{"cty":"United Kingdom","gzip":"H4sIAAAAAAAAAKtWystVslJQcs4rLVHSUUouqQTxQvMyS1JTFLwz89JT8nOB4hnFqSBxj/zS4lSF/DQFl9S83MSibKBMZVExSMbQwNBM19DA2FSpFgDvJUGVUwAAAA==","nm":"Edmund lronside","yrs":"1016"}

我已经有了将压缩字段提取为字符串的函数。

问题:

如果我使用以下代码构建RDD:

{"cty":"United Kingdom","gzip":{"nm": "Cnut","cty": "United Kingdom","hse": "House of Denmark","yrs": "1016-1035"},"nm":"Edmund lronside","yrs":"1016"}

我得到一行包含4列(String,String,String,String)

val jsonData = sqlContext.read.json(sourceFilesPath)
//
//loop through the DataFrame and manipulate the gzip Filed
val jsonUnGzip = jsonData.map(r => Row(r.getString(0), GZipHelper.unCompress(r.getString(1)).get, r.getString(2), r.getString(3)))

现在,我不能告诉Spark将Col(1)“重新解析”为JSON,对吗?

我看过一些关于使用案例类或爆炸的帖子,但我不明白这有什么用呢?

0 个答案:

没有答案