如何使用Apache Spark 1.4.0数据框提取复杂的JSON结构

时间:2015-06-24 01:33:34

标签: apache-spark apache-spark-sql

我正在使用新的 Apache Spark版本1.4.0数据框架API 从Twitter的状态JSON中提取信息,主要集中在Entities Object - 此问题的相关部分是如下所示:

import random   

def main():
    random_int = random.randint(6, 12)
    print ("The list size will be %d"  %random_int)
    elements = makelist(random_int)
    print("This sorted list is %s"  %' '.join(map(str, sorted(elements))) )

def makelist(random_int):
     number_list = []
     for count in range(random_int):
         number_list.append(random.randint(1, 100))
     return number_list 

有几个关于如何从基元类型中提取信息的示例{ ... ... "entities": { "hashtags": [], "trends": [], "urls": [], "user_mentions": [ { "screen_name": "linobocchini", "name": "Lino Bocchini", "id": 187356243, "id_str": "187356243", "indices": [ 3, 16 ] }, { "screen_name": "jeanwyllys_real", "name": "Jean Wyllys", "id": 111123176, "id_str": "111123176", "indices": [ 79, 95 ] } ], "symbols": [] }, ... ... } string等 - 但我找不到任何关于如何处理那些复杂结构。

我尝试了下面的代码,但它仍然不起作用,它会引发异常

integer

我尝试拨打val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val tweets = sqlContext.read.json("tweets.json") // this function is just to filter empty entities.user_mentions[] nodes // some tweets doesn't contains any mentions import org.apache.spark.sql.functions.udf val isEmpty = udf((value: List[Any]) => value.isEmpty) import org.apache.spark.sql._ import sqlContext.implicits._ case class UserMention(id: Long, idStr: String, indices: Array[Long], name: String, screenName: String) val mentions = tweets.select("entities.user_mentions"). filter(!isEmpty($"user_mentions")). explode($"user_mentions") { case Row(arr: Array[Row]) => arr.map { elem => UserMention( elem.getAs[Long]("id"), elem.getAs[String]("is_str"), elem.getAs[Array[Long]]("indices"), elem.getAs[String]("name"), elem.getAs[String]("screen_name")) } } mentions.first 时出现例外情况:

mentions.first

这里有什么问题?我知道它与类型有关但我还不知道它。

作为附加上下文,自动映射的结构是:

scala>     mentions.first
15/06/23 22:15:06 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 8)
scala.MatchError: [List([187356243,187356243,List(3, 16),Lino Bocchini,linobocchini], [111123176,111123176,List(79, 95),Jean Wyllys,jeanwyllys_real])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
    at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
    at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
    at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55)
    at org.apache.spark.sql.catalyst.expressions.UserDefinedGenerator.eval(generators.scala:81)

注1:我知道可以使用scala> mentions.printSchema root |-- user_mentions: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: long (nullable = true) | | |-- id_str: string (nullable = true) | | |-- indices: array (nullable = true) | | | |-- element: long (containsNull = true) | | |-- name: string (nullable = true) | | |-- screen_name: string (nullable = true) 来解决这个问题,但是一旦有太多的动力,我想使用数据框架。

HiveQL

注2: UDF SELECT explode(entities.user_mentions) as mentions FROM tweets 是一个丑陋的黑客,我在这里遗漏了一些东西,但这是我出现的唯一方法来避免NPE

2 个答案:

答案 0 :(得分:4)

这是一个有效的解决方案,只有一个小黑客。

主要思想是通过声明List [String]而不是List [Row]来解决类型问题:

val mentions = tweets.explode("entities.user_mentions", "mention"){m: List[String] => m}

这会创建一个名为&#34;提及&#34;的第二列。类型&#34;结构&#34;:

|            entities|             mention| 
+--------------------+--------------------+ 
|[List(),List(),Li...|[187356243,187356...| 
|[List(),List(),Li...|[111123176,111123...| 

现在做一个map()来提取提到的字段。 getStruct(1)调用获取每行第1列中的值:

case class Mention(id: Long, id_str: String, indices: Seq[Int], name: String, screen_name: String)
val mentionsRdd = mentions.map(
  row => 
    {  
      val mention = row.getStruct(1)
      Mention(mention.getLong(0), mention.getString(1), mention.getSeq[Int](2), mention.getString(3), mention.getString(4))
    }
)

将RDD转换回DataFrame:

val mentionsDf = mentionsRdd.toDF()

你去吧!

|       id|   id_str|     indices|         name|    screen_name|
+---------+---------+------------+-------------+---------------+
|187356243|187356243| List(3, 16)|Lino Bocchini|   linobocchini|
|111123176|111123176|List(79, 95)|  Jean Wyllys|jeanwyllys_real|

答案 1 :(得分:-1)

尝试这样做:

case Row(arr: Seq[Row]) => arr.map { elem =>