Question

我有CSV数据。我首先要将其转换为Json，然后再将其转换为Pair RDD。

我能够做这两种事情，但是我不确定这样做是否有效，而且它们的密钥不是预期的格式。


    val df = //some how read the csv data
    val dataset = df.toJSON //This gives the expected json.
    val pairRDD = dataset.rdd.map(record => (JSON.parseFull(record).get.asInstanceOf[Map[String, String]].get("hashKey"), record))

假设我的模式是


    root
     |-- hashKey: string (nullable = true)
     |-- sortKey: string (nullable = true)
     |-- score: number (nullable = true)
     |-- payload: string (nullable = true)


    In json
    {
    "hashKey" : "h1",
    "sortKey" : "s1",
    "score" : 1.0,
    "payload" : "data"
    }
    {
    "hashKey" : "h2",
    "sortKey" : "s2",
    "score" : 1.0,
    "payload" : "data"
    }

    EXPECTED result should be
    [1, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
    [2, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]


    ACTUAL result I am getting
    [**Some(1)**, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
    [**Some(2)**, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]

我可以解决这个问题吗？

Answer 1

是因为get("hashKey")。将其更改为getOrElse("hashKey","{defaultKey}")-当您的默认密钥可以为""或之前声明的常量时。

更新为更安全的Scala方法（而不是使用instance of）

最好将您的json解析更改为此：

dataset.rdd.map(record => JSON.parseFull(record).map{
    case json: Map[String, String] => (json.getOrElse("hashKey",""), record)
    case _ => ("", "")
}.filter{ case (key, record) => key != "" && record != "") }

将CSV转换为JSON以在Scala Spark中配对RDD

1 个答案: