如何将包装数组转换为spark scala中的数据集?

时间:2017-10-30 22:08:36

标签: scala apache-spark

enter image description here

嗨,我是新来的火花scala。我在json文件中有这个结构,我需要将其转换为数据集。由于嵌套数据,我无法做到这一点。

我试图做一些这样的事情,我从一些帖子中得到但是它不起作用。有人可以建议我解决方案吗?

  spark.read.json(path).map(r=>r.getAs[mutable.WrappedArray[String]]("readings"))

2 个答案:

答案 0 :(得分:2)

您的JSON格式无法转换为dataframe。需要转换为json / dataframe dataset的{​​{1}}个信息应该是一行。

因此,您要做的第一步是阅读 json文件并转换为有效的row格式。您可以使用json api和一些替换。

wholeTextFiles

第二步是将有效val rdd = sc.wholeTextFiles("path to your json text file") val validJson = rdd.map(_._2.replace(" ", "").replace("\n", "")) 数据转换为jsondataframe。我在这里使用dataset

dataframe

应该给你

val dataFrame = sqlContext.read.json(validJson)

现在选择+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |did |readings | +--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |d7cc92c24be32d5d419af1277289313c|[[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++]),1506770544]]| +--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ root |-- did: string (nullable = true) |-- readings: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- clients: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- cid: string (nullable = true) | | | | |-- clientOS: string (nullable = true) | | | | |-- rssi: long (nullable = true) | | | | |-- snRatio: long (nullable = true) | | | | |-- ssid: string (nullable = true) | | |-- ts: long (nullable = true) 很容易就是

WrappedArray

应该给你

dataFrame.select("readings.clients")

我希望答案很有帮助

<强>更新

+------------------------------------------------------------------------------------------------------------------------------------------------------------+ |clients | +------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++])]| +------------------------------------------------------------------------------------------------------------------------------------------------------------+ Dataframe几乎相同,只是datasets是使用编码器的类型安全,并且datasets优于datasets

长话短说,您可以通过创建dataframesdataframe更改为dataset。对于您的情况,您需要三个case classes

case classes

然后将case class client(cid: String, clientOS: String, rssi: Long, snRatio: Long, ssid: String) case class reading(clients: Array[client], ts: Long) case class dataset(did: String, readings: Array[reading]) 转换为dataframe

dataset

你手上应该有val dataSet = sqlContext.read.json(validJson).as[dataset] :)

答案 1 :(得分:1)

您无法使用以下代码

创建DataSet
spark.read.json(path).map(r => r.getAs[WrappedArray[String]]("readings"))

检查读取JSON时创建的DF的clients类型的模式。

spark.read.json(path).printSchema

root
 |-- did: string (nullable = true)
 |-- readings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- clients: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- cid: string (nullable = true)
 |    |    |    |    |-- clientOS: string (nullable = true)
 |    |    |    |    |-- rssi: long (nullable = true)
 |    |    |    |    |-- snRatio: long (nullable = true)
 |    |    |    |    |-- ssid: string (nullable = true)
 |    |    |-- ts: long (nullable = true)

您可以使用以下代码获取scala.collection.mutable.WrappedArray对象

spark.read.json(path).first.getAs[WrappedArray[(String,String,Long,Long,String)]]("readings")

如果您需要创建数据框,请使用以下内容。

spark.read.json(path).select("readings.clients")