Spark JSON文本字段到RDD

时间:2015-05-04 15:26:16

标签: scala cassandra apache-spark rdd

我有一个cassandra表,其中包含一个名为snapshot的文本字段,其中包含JSON对象:

[identifier, timestamp, snapshot]

我明白为了能够使用Spark对该字段进行转换,我需要将该RDD的该字段转换为另一个RDD以对JSON模式进行转换。

这是对的吗?我应该怎么做呢?

编辑:现在我设法从单个文本字段创建RDD:

val conf = new SparkConf().setAppName("signal-aggregation")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val snapshots = sc.cassandraTable[(String, String, String)]("listener", "snapshots")
val first = snapshots.first()
val firstJson = sqlContext.jsonRDD(sc.parallelize(Seq(first._3)))
firstJson.printSchema()

其中显示了JSON模式。好!

如何告诉Spark应该在表Snapshots的所有行上应用此架构,以便从每一行获取该快照字段的RDD?

1 个答案:

答案 0 :(得分:13)

几乎就在那里,你只想将你的json传递给你的RDD [String] jsonRDD方法

val conf = new SparkConf().setAppName("signal-aggregation")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val snapshots = sc.cassandraTable[(String, String, String)]("listener", "snapshots")
val jsons = snapshots.map(_._3) // Get Third Row Element Json(RDD[String]) 
val jsonSchemaRDD = sqlContext.jsonRDD(jsons) // Pass in RDD directly
jsonSchemaRDD.registerTempTable("testjson")
sqlContext.sql("SELECT * FROM testjson where .... ").collect 

一个简单的例子

val stringRDD = sc.parallelize(Seq(""" 
  { "isActive": false,
    "balance": "$1,431.73",
    "picture": "http://placehold.it/32x32",
    "age": 35,
    "eyeColor": "blue"
  }""",
   """{
    "isActive": true,
    "balance": "$2,515.60",
    "picture": "http://placehold.it/32x32",
    "age": 34,
    "eyeColor": "blue"
  }""", 
  """{
    "isActive": false,
    "balance": "$3,765.29",
    "picture": "http://placehold.it/32x32",
    "age": 26,
    "eyeColor": "blue"
  }""")
)
sqlContext.jsonRDD(stringRDD).registerTempTable("testjson")
csc.sql("SELECT age from testjson").collect
//res24: Array[org.apache.spark.sql.Row] = Array([35], [34], [26])