Question

以下是我的数据文件的示例行：

{"externalUserId":"f850bgv8-c638-4ab2-a68a d79375fa2091","externalUserPw":null,"ipaddr":null,"eventId":0,"userId":1713703316,"applicationId":489167,"eventType":201,"eventData":"{\"apps\":[\"com.happyadda.jalebi\"],\"appType\":2}","device":null,"version":"3.0.0-b1","bundleId":null,"appPlatform":null,"eventDate":"2017-01-22T13:46:30+05:30"}`

我有数百万个这样的行，如果整个文件是单个json我可以使用json reader但是我怎样才能处理单个文件中的多个json行并将它们转换为表。

如何使用列将此数据转换为sql表：

 |externalUserId |externalUserPw|ipaddr| eventId  |userId    |.......
 |---------------|--------------|------|----------|----------|.......
 |f850bgv8-..... |null          |null  |0         |1713703316|.......

Answer 1

您可以使用spark内置read.json功能。对于你的情况，这似乎很好，每行包含一个JSON。

例如，以下内容根据JSON文件的内容创建一个DataFrame：

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()

更多信息：http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources

Spark SQL可以自动推断JSON数据集的架构并将其加载为数据集[Row]。可以使用SparkSession.read.json()在字符串RDD，或JSON文件上完成此转换。

请注意，作为json文件提供的文件不是典型的JSON文件。 每一行必须包含一个单独的，自包含的有效JSON对象。有关更多信息，请参阅JSON Lines文本格式，也称为换行符分隔的JSON。因此，常规的多行JSON文件通常会失败。

scala - 将每个json行转换为表

1 个答案: