从字符串化json的RDD的行中获取键和值

时间:2018-12-11 13:08:50

标签: json scala apache-spark rdd

我的RDD的每一行看起来都像这样:

[{"date":1.533204038E12,"time":1.533204038E12,"num":"KD10617029","type":"item","vat":0}]

我的功能:

def writeToES(data: java.util.List[String]): Unit = {

    val conf: SparkConf = new SparkConf().setAppName("ESWriter").setMaster("local")
    val sc: SparkContext = new SparkContext(conf)
    val sql: SQLContext = new SQLContext(sc)
    val spark: SparkSession = sql.sparkSession
    sc.setLogLevel("ERROR")
    import spark.implicits._

    val dataList = data.toArray()
    //println("datalist size: "+dataList.size)
    val dataDF = sc.parallelize(dataList)
              .map(x=>x.toString)
              .map(x=>x.split(","))
              .map(x=>Row.fromSeq(x))
              .map(x=>x.mkString(",")).toDF()
    dataDF.show()
    dataDF.take(1).toList.foreach(println)
    println(dataDF.take(1).length)
}

如何从列表中的字符串化json获取“键” ... 以及如何在rdd(或数据框)中以行的形式获取每个json的值

1 个答案:

答案 0 :(得分:1)

按照@ user238607的建议,您可以直接转换字符串。但是您也可以直接使用中间RDD(带有json字符串):

LocalTime

这将从中间RDD创建一个DataFrame。

$oldFile = Import-Excel ".\personnel_28_11_2018---small2.xlsx"
$newFile = Import-Excel ".\personnel_16_12_2018---small2.xlsx"

$properties = "TRIAL_PK", "TRIALCOUNTRY_PK", "TRIALSSITE_PK", "ASSIGNMENT_LVL", "ROLE", "INT_EXT", "START_DATA", "END_DATE", "PERSONNELL_PK", "TITLE", "LAST_NAME", "FIRST_NAME", "ORGANIZATION_NAME"

$result = Compare-Object -ReferenceObject $oldFile -DifferenceObject $newFile -Property $properties -PassThru -CaseSensitive | Where-Object {$_.SideIndicator -eq "=>"}

$result | Select-Object $properties | Export-Excel ".\changed.xlsx"

对于Spark> = 2.2.0,对于json()函数,请使用数据集,而不是RDD。