我有CSV数据。我首先要将其转换为Json
,然后再将其转换为Pair RDD
。
我能够做这两种事情,但是我不确定这样做是否有效,而且它们的密钥不是预期的格式。
val df = //some how read the csv data
val dataset = df.toJSON //This gives the expected json.
val pairRDD = dataset.rdd.map(record => (JSON.parseFull(record).get.asInstanceOf[Map[String, String]].get("hashKey"), record))
假设我的模式是
root
|-- hashKey: string (nullable = true)
|-- sortKey: string (nullable = true)
|-- score: number (nullable = true)
|-- payload: string (nullable = true)
In json
{
"hashKey" : "h1",
"sortKey" : "s1",
"score" : 1.0,
"payload" : "data"
}
{
"hashKey" : "h2",
"sortKey" : "s2",
"score" : 1.0,
"payload" : "data"
}
EXPECTED result should be
[1, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
[2, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]
ACTUAL result I am getting
[**Some(1)**, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
[**Some(2)**, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]
我可以解决这个问题吗?
答案 0 :(得分:1)
是因为get("hashKey")
。将其更改为getOrElse("hashKey","{defaultKey}")
-当您的默认密钥可以为""
或之前声明的常量时。
更新为更安全的Scala方法(而不是使用instance of
)
最好将您的json解析更改为此:
dataset.rdd.map(record => JSON.parseFull(record).map{
case json: Map[String, String] => (json.getOrElse("hashKey",""), record)
case _ => ("", "")
}.filter{ case (key, record) => key != "" && record != "") }