我有一个类似输入csv的文件:
"2017-06-01T01:01:01Z";"{\"name\":\"aaa\",\"properties\":{"\"propA\":\"some value\",\"propB\":\"other value\"}}"
"2017-06-01T01:01:01Z";"{\"name\":\"bbb\",\"properties\":{"\"propB\":\"some value\","\"propC\":\"some value\",\"propD\":\"other value\"}}"
我想得到像这样的json字符串,以便我可以从普通的json字符串创建数据框:
[{
"createdTime": "...",
"value":{
"name":"...",
"properties": {
"propA":"...",
"propB":"..."
}
}
},{
"createdTime": "...",
"value":{
"name":"...",
"properties": {
"propB":"...",
"propC":"...",
"propD":"..."
}
}
}]
这是半结构化数据。某些行可能具有属性A,但其他行。
如何在Spark with Scalar中执行此操作?
答案 0 :(得分:0)
根据我从您的问题中理解的是,您希望从类似csv的文件中创建dataframe
。如果我说得对,那就是你可以做的
val data = sc.textFile("path to your csv-like file")
val jsonrdd = data.map(line => line.split(";"))
.map(array => "{\"createdTime\":"+array(0)+",\"value\":"+ array(1).replace(",\"", ",").replace("\\\"", "\"").replace("\"{", "{").replace("{\"\"", "{\"").replace("}\"", "}")+"},")
val df = sqlContext.read.json(jsonrdd)
df.show(false)
您应该dataframe
作为
+--------------------+----------------------------------------------+
|createdTime |value |
+--------------------+----------------------------------------------+
|2017-06-01T01:01:01Z|[aaa,[some value,other value,null,null]] |
|2017-06-01T01:01:01Z|[bbb,[null,some value,some value,other value]]|
+--------------------+----------------------------------------------+
以上dataframe's
schema
将是
root
|-- createdTime: string (nullable = true)
|-- value: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- properties: struct (nullable = true)
| | |-- propA: string (nullable = true)
| | |-- propB: string (nullable = true)
| | |-- propC: string (nullable = true)
| | |-- propD: string (nullable = true)