我有一个包含以下行格式的JSON文件:
List<User> users = new ArrayList<>();
while(someValue) {
...
int id = ...
String firstName= ...
String lastName = ...
User user = new User();
user.setId(id);
user.setFirstName(firstName);
user.setLastName(lastName);
users .add(user);
}
// after do what you want with the list
gzip字段本身就是一个压缩的JSON
我想读取文件并将完整的嵌套JSON构建为一行:
{"cty":"United Kingdom","gzip":"H4sIAAAAAAAAAKtWystVslJQcs4rLVHSUUouqQTxQvMyS1JTFLwz89JT8nOB4hnFqSBxj/zS4lSF/DQFl9S83MSibKBMZVExSMbQwNBM19DA2FSpFgDvJUGVUwAAAA==","nm":"Edmund lronside","yrs":"1016"}
我已经有了将压缩字段提取为字符串的函数。
问题:
如果我使用以下代码构建RDD:
{"cty":"United Kingdom","gzip":{"nm": "Cnut","cty": "United Kingdom","hse": "House of Denmark","yrs": "1016-1035"},"nm":"Edmund lronside","yrs":"1016"}
我得到一行包含4列(String,String,String,String)
val jsonData = sqlContext.read.json(sourceFilesPath)
//
//loop through the DataFrame and manipulate the gzip Filed
val jsonUnGzip = jsonData.map(r => Row(r.getString(0), GZipHelper.unCompress(r.getString(1)).get, r.getString(2), r.getString(3)))
现在,我不能告诉Spark将Col(1)“重新解析”为JSON,对吗?
我看过一些关于使用案例类或爆炸的帖子,但我不明白这有什么用呢?