我遇到使用Spark 1.4拆分数据帧列内容的问题。数据框是通过读取嵌套的复杂json文件创建的。我使用了df.explode但不断收到错误消息。 json文件格式如下:
[
{
"neid":{ },
"mi":{
"mts":"20100609071500Z",
"gp":"900",
"tMOID":"Aal2Ap",
"mt":[ ],
"mv":[
{
"moid":"ManagedElement=1,TransportNetwork=1,Aal2Sp=1,Aal2Ap=r1552q",
"r":
[
1,
2,
5
]
},
{
"moid":"ManagedElement=1,TransportNetwork=1,Aal2Sp=1,Aal2Ap=r1542q",
"r":
[
1,
2,
5
]
}
]
}
},
{
"neid":{
"neun":"RC003",
"nedn":"SubNetwork=ONRM_RootMo_R,SubNetwork=RC003,MeContext=RC003",
"nesw":"CP90831_R9YC/11"
},
"mi":{
"mts":"20100609071500Z",
"gp":"900",
"tMOID":"PlugInUnit",
"mt":"pmProcessorLoad",
"mv":[
{
"moid":"ManagedElement=1,Equipment=1,Subrack=MS,Slot=6,PlugInUnit=1",
"r":
[ 1, 2, 5
]
},
{
"moid":"ManagedElement=1,Equipment=1,Subrack=ES-1,Slot=1,PlugInUnit=1",
"r":
[ 1, 2, 5
]
}
]
}
}
]
我使用以下代码加载Spark 1.4
scala> val df = sqlContext.read.json("/Users/xx/target/statsfile.json")
scala> df.show()
+--------------------+--------------------+
| mi| neid|
+--------------------+--------------------+
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...|
|[900,["pmIcmpInEr...|[SubNetwork=ONRM_...|
|[900,pmUnsuccessf...|[SubNetwork=ONRM_...|
|[900,["pmBwErrBlo...|[SubNetwork=ONRM_...|
|[900,["pmSctpStat...|[SubNetwork=ONRM_...|
|[900,["pmLinkInSe...|[SubNetwork=ONRM_...|
|[900,["pmGrFc","p...|[SubNetwork=ONRM_...|
|[900,["pmReceived...|[SubNetwork=ONRM_...|
|[900,["pmIvIma","...|[SubNetwork=ONRM_...|
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...|
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...|
|[900,["pmExisOrig...|[SubNetwork=ONRM_...|
|[900,["pmHDelayVa...|[SubNetwork=ONRM_...|
|[900,["pmReceived...|[SubNetwork=ONRM_...|
|[900,["pmReceived...|[SubNetwork=ONRM_...|
|[900,["pmAverageR...|[SubNetwork=ONRM_...|
|[900,["pmDchFrame...|[SubNetwork=ONRM_...|
|[900,["pmReceived...|[SubNetwork=ONRM_...|
|[900,["pmNegative...|[SubNetwork=ONRM_...|
|[900,["pmUsedTbsQ...|[SubNetwork=ONRM_...|
+--------------------+--------------------+
scala> df.printSchema()
root
|-- mi: struct (nullable = true)
| |-- gp: long (nullable = true)
| |-- mt: string (nullable = true)
| |-- mts: string (nullable = true)
| |-- mv: string (nullable = true)
|-- neid: struct (nullable = true)
| |-- nedn: string (nullable = true)
| |-- nesw: string (nullable = true)
| |-- neun: string (nullable = true)
scala> val df1=df.select("mi.mv").show()
+--------------------+
| mv|
+--------------------+
|[{"r":[0,0,0],"mo...|
|{"r":[0,4,0,4],"m...|
|{"r":5,"moid":"Ma...|
|[{"r":[2147483647...|
|{"r":[225,1112986...|
|[{"r":[83250,0,0,...|
|[{"r":[1,2,529982...|
|[{"r":[26998564,0...|
|[{"r":[0,0,0,0,0,...|
|[{"r":[0,0,0],"mo...|
|[{"r":[0,0,0],"mo...|
|{"r":[0,0,0,0,0,0...|
|{"r":[0,0,1],"moi...|
|{"r":[4587,4587],...|
|[{"r":[180,180],"...|
|[{"r":["0,0,0,0,0...|
|{"r":[0,35101,0,0...|
|[{"r":["0,0,0,0,0...|
|[{"r":[0,1558],"m...|
|[{"r":["7484,4870...|
+--------------------+
scala> df1.explode("mv","mvnew")(mv: String => mv.split(","))
<console>:1: error: ')' expected but '(' found.
df1.explode("mv","mvnew")(mv: String => mv.split(","))
^
<console>:1: error: ';' expected but ')' found.
df1.explode("mv","mvnew")(mv: String => mv.split(","))
我做错了吗?我需要在单独的列中提取mi.mv下的数据,以便我可以应用一些转换。
答案 0 :(得分:2)
我知道这已经过时了,但我有一个解决方案,对于正在寻找解决此问题的人(就像我一样)有用。我一直在使用使用scala 2.10.4构建的spark 1.5。
它似乎只是一种格式化的东西。我正在复制上面的所有错误,对我有用的是
df1.explode("mv","mvnew"){mv: String => mv.asInstanceOf[String].split(",")}
我不完全理解为什么我需要将mv定义为字符串两次,如果有人愿意解释这一点,我会感兴趣,但这应该会让某人爆炸数据帧列。
还有一个问题。如果你分裂一个特殊的角色(比如说&#34;?&#34;)你需要两次逃脱它。所以在上面,分裂了&#34;?&#34;会给:
df1.explode("mv","mvnew"){mv: String => mv.asInstanceOf[String].split("\\?")}
我希望这可以帮助某个人。
答案 1 :(得分:1)
删除String
的{{1}}输入,如下所示:
mv
因为输入已经在df1.explode("mv","mvnew")(mv => mv.split(","))
定义中。
然后您会收到其他错误,其中explode
的类型为df1
而不是Unit
。您可以按如下方式解决此问题:
DataFrame
这是因为val df1=df.select("mi.mv")
df1.show()
df1.explode...
返回了您之前尝试运行show()
的{{1}}类型的值。以上内容可确保您在实际Unit
上运行explode
。