Question

我有一个类似输入csv的文件：

"2017-06-01T01:01:01Z";"{\"name\":\"aaa\",\"properties\":{"\"propA\":\"some value\",\"propB\":\"other value\"}}"
"2017-06-01T01:01:01Z";"{\"name\":\"bbb\",\"properties\":{"\"propB\":\"some value\","\"propC\":\"some value\",\"propD\":\"other value\"}}"

我想得到像这样的json字符串，以便我可以从普通的json字符串创建数据框：

[{
  "createdTime": "...",
  "value":{
    "name":"...",
    "properties": {
      "propA":"...",
      "propB":"..."
    }
  }
},{
  "createdTime": "...",
  "value":{
    "name":"...",
    "properties": {
      "propB":"...",
      "propC":"...",
      "propD":"..."
    }
  }
}]

这是半结构化数据。某些行可能具有属性A，但其他行。

如何在Spark with Scalar中执行此操作？

Answer 1

根据我从您的问题中理解的是，您希望从类似csv的文件中创建dataframe。如果我说得对，那就是你可以做的

val data = sc.textFile("path to your csv-like file")
val jsonrdd = data.map(line => line.split(";"))
  .map(array => "{\"createdTime\":"+array(0)+",\"value\":"+ array(1).replace(",\"", ",").replace("\\\"", "\"").replace("\"{", "{").replace("{\"\"", "{\"").replace("}\"", "}")+"},")

val df = sqlContext.read.json(jsonrdd)
df.show(false)

您应该dataframe作为

+--------------------+----------------------------------------------+
|createdTime         |value                                         |
+--------------------+----------------------------------------------+
|2017-06-01T01:01:01Z|[aaa,[some value,other value,null,null]]      |
|2017-06-01T01:01:01Z|[bbb,[null,some value,some value,other value]]|
+--------------------+----------------------------------------------+

以上dataframe's schema将是

root
 |-- createdTime: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- properties: struct (nullable = true)
 |    |    |-- propA: string (nullable = true)
 |    |    |-- propB: string (nullable = true)
 |    |    |-- propC: string (nullable = true)
 |    |    |-- propD: string (nullable = true)

类似于csv的输入文本到json字符串

1 个答案: