使用Spark 1.4为嵌套的json数据拆分数据帧列的内容

时间:2015-07-01 20:09:33

标签: scala apache-spark

我遇到使用Spark 1.4拆分数据帧列内容的问题。数据框是通过读取嵌套的复杂json文件创建的。我使用了df.explode但不断收到错误消息。 json文件格式如下:

[   
    {   
        "neid":{  }, 
        "mi":{   
            "mts":"20100609071500Z", 
            "gp":"900", 
            "tMOID":"Aal2Ap", 
            "mt":[  ], 
            "mv":[   
                {   
                    "moid":"ManagedElement=1,TransportNetwork=1,Aal2Sp=1,Aal2Ap=r1552q", 
                    "r": 
                    [ 
                     1, 
                     2, 
                     5 
                     ] 
                }, 
                { 
                    "moid":"ManagedElement=1,TransportNetwork=1,Aal2Sp=1,Aal2Ap=r1542q", 
                    "r": 
                    [ 
                     1, 
                     2, 
                     5 
                     ] 
 } 
            ] 
        } 
    }, 
    {   
        "neid":{   
            "neun":"RC003", 
            "nedn":"SubNetwork=ONRM_RootMo_R,SubNetwork=RC003,MeContext=RC003", 
            "nesw":"CP90831_R9YC/11" 
        }, 
        "mi":{   
            "mts":"20100609071500Z", 
            "gp":"900", 
            "tMOID":"PlugInUnit", 
            "mt":"pmProcessorLoad", 
            "mv":[   
                {   
                    "moid":"ManagedElement=1,Equipment=1,Subrack=MS,Slot=6,PlugInUnit=1", 
                   "r": 
                     [ 1, 2, 5 
                     ] 
                }, 
                {   
                    "moid":"ManagedElement=1,Equipment=1,Subrack=ES-1,Slot=1,PlugInUnit=1", 
                   "r": 
                  [ 1, 2, 5 
                     ] 
                } 
            ] 
        } 
    } 
]

我使用以下代码加载Spark 1.4

scala> val df = sqlContext.read.json("/Users/xx/target/statsfile.json") 

scala> df.show() 
+--------------------+--------------------+ 
|                  mi|                neid| 
+--------------------+--------------------+ 
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...| 
|[900,["pmIcmpInEr...|[SubNetwork=ONRM_...| 
|[900,pmUnsuccessf...|[SubNetwork=ONRM_...| 
|[900,["pmBwErrBlo...|[SubNetwork=ONRM_...| 
|[900,["pmSctpStat...|[SubNetwork=ONRM_...| 
|[900,["pmLinkInSe...|[SubNetwork=ONRM_...| 
|[900,["pmGrFc","p...|[SubNetwork=ONRM_...| 
|[900,["pmReceived...|[SubNetwork=ONRM_...| 
|[900,["pmIvIma","...|[SubNetwork=ONRM_...| 
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...| 
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...| 
|[900,["pmExisOrig...|[SubNetwork=ONRM_...| 
|[900,["pmHDelayVa...|[SubNetwork=ONRM_...| 
|[900,["pmReceived...|[SubNetwork=ONRM_...| 
|[900,["pmReceived...|[SubNetwork=ONRM_...| 
|[900,["pmAverageR...|[SubNetwork=ONRM_...| 
|[900,["pmDchFrame...|[SubNetwork=ONRM_...| 
|[900,["pmReceived...|[SubNetwork=ONRM_...| 
|[900,["pmNegative...|[SubNetwork=ONRM_...| 
|[900,["pmUsedTbsQ...|[SubNetwork=ONRM_...| 
+--------------------+--------------------+ 
scala> df.printSchema() 
root 
 |-- mi: struct (nullable = true) 
 |    |-- gp: long (nullable = true) 
 |    |-- mt: string (nullable = true) 
 |    |-- mts: string (nullable = true) 
 |    |-- mv: string (nullable = true) 
 |-- neid: struct (nullable = true) 
 |    |-- nedn: string (nullable = true) 
 |    |-- nesw: string (nullable = true) 
 |    |-- neun: string (nullable = true) 

scala> val df1=df.select("mi.mv").show() 
+--------------------+ 
|                  mv| 
+--------------------+ 
|[{"r":[0,0,0],"mo...| 
|{"r":[0,4,0,4],"m...| 
|{"r":5,"moid":"Ma...| 
|[{"r":[2147483647...| 
|{"r":[225,1112986...| 
|[{"r":[83250,0,0,...| 
|[{"r":[1,2,529982...| 
|[{"r":[26998564,0...| 
|[{"r":[0,0,0,0,0,...| 
|[{"r":[0,0,0],"mo...| 
|[{"r":[0,0,0],"mo...| 
|{"r":[0,0,0,0,0,0...| 
|{"r":[0,0,1],"moi...| 
|{"r":[4587,4587],...| 
|[{"r":[180,180],"...| 
|[{"r":["0,0,0,0,0...| 
|{"r":[0,35101,0,0...| 
|[{"r":["0,0,0,0,0...| 
|[{"r":[0,1558],"m...| 
|[{"r":["7484,4870...| 
+--------------------+ 

scala> df1.explode("mv","mvnew")(mv: String => mv.split(",")) 
<console>:1: error: ')' expected but '(' found. 
       df1.explode("mv","mvnew")(mv: String => mv.split(",")) 
                                                       ^ 
<console>:1: error: ';' expected but ')' found. 
       df1.explode("mv","mvnew")(mv: String => mv.split(",")) 

我做错了吗?我需要在单独的列中提取mi.mv下的数据,以便我可以应用一些转换。

2 个答案:

答案 0 :(得分:2)

我知道这已经过时了,但我有一个解决方案,对于正在寻找解决此问题的人(就像我一样)有用。我一直在使用使用scala 2.10.4构建的spark 1.5。

它似乎只是一种格式化的东西。我正在复制上面的所有错误,对我有用的是

df1.explode("mv","mvnew"){mv: String => mv.asInstanceOf[String].split(",")} 

我不完全理解为什么我需要将mv定义为字符串两次,如果有人愿意解释这一点,我会感兴趣,但这应该会让某人爆炸数据帧列。

还有一个问题。如果你分裂一个特殊的角色(比如说&#34;?&#34;)你需要两次逃脱它。所以在上面,分裂了&#34;?&#34;会给:

df1.explode("mv","mvnew"){mv: String => mv.asInstanceOf[String].split("\\?")} 

我希望这可以帮助某个人。

答案 1 :(得分:1)

删除String的{​​{1}}输入,如下所示:

mv

因为输入已经在df1.explode("mv","mvnew")(mv => mv.split(",")) 定义中。

更新(见评论)

然后您会收到其他错误,其中explode的类型为df1而不是Unit。您可以按如下方式解决此问题:

DataFrame

这是因为val df1=df.select("mi.mv") df1.show() df1.explode... 返回了您之前尝试运行show()的{​​{1}}类型的值。以上内容可确保您在实际Unit上运行explode