Question

我在Java中有一个Dataset<Row>。我需要读取1列（它是JSON字符串）的值，对其进行解析，然后根据已解析的JSON值设置其他几列的值。

我的数据集如下：

|json                     | name|  age |
======================================== 
| "{'a':'john', 'b': 23}" | null| null |
----------------------------------------
| "{'a':'joe', 'b': 25}"  | null| null |
----------------------------------------
| "{'a':'zack'}"          | null| null |
----------------------------------------

我需要这样：

|json                     | name  |  age |
======================================== 
| "{'a':'john', 'b': 23}" | 'john'| 23 |
----------------------------------------
| "{'a':'joe', 'b': 25}"  | 'joe' | 25 |
----------------------------------------
| "{'a':'zack'}"          | 'zack'|null|
----------------------------------------

我无法找到一种方法。请提供代码帮助。

Answer 1

Spark中存在一个功能get_json_object。建议您使用一个名为df的数据框，可以选择以下方式解决问题：

df.selectExpr("get_json_object(json, '$.a') as name", "get_json_object(json, '$.b') as age" )

但是首先，请确保您的json属性使用双引号而不是单引号。

注意： there is a full list of Spark SQL functions。我正在大量使用它。考虑将其添加到书签中，并参考时间。

Answer 2

您可以使用UDF

def parseName(json: String): String = ??? // parse json
val parseNameUDF = udf[String, String](parseName)

def parseAge(json: String): Int = ??? // parse json
val parseAgeUDF = udf[Int, String](parseAge)

dataFrame
.withColumn("name", parseNameUDF(dataFrame("json")))
.withColumn("age", parseAgeUDF(dataFrame("json")))

需要基于1列的值在数据集中的列中设置值

2 个答案: