处理Spark UDF中的XML字符串并返回Struct Field

时间:2017-09-20 04:13:57

标签: scala apache-spark xml-parsing spark-dataframe user-defined-functions

我有一个名为Body(String)的数据框列。正文列数据如下所示

create!

使用Body我想准备两个单独的代码和文本列。代码在名为代码的元素之间,文本是其他所有内容。

我创建了一个看起来像这样的UDF

CSV.parse

这不起作用。我的问题是 1.如您所见,数据中没有根节点。我还可以使用scala XML解析吗? 2.如何将除代码之外的所有其他内容解析为文本。

如果我的代码有问题请告诉我

预期产出:

<p>I want to use a track-bar to change a form's opacity.</p>

<p>This is my code:</p>

 <pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>

<p>When I build the application, it gives the following error:</p>

<blockquote>
  <p>Cannot implicitly convert type 'decimal' to 'double'.</p>
</blockquote>

<p>I tried using <code>trans</code> and <code>double</code> but then the 
control doesn't work. This code worked fine in a past VB.NET project. </p>
,While applying opacity to a form should we use a decimal or double value?

1 个答案:

答案 0 :(得分:1)

您也可以使用RewriteRule替换XML类的transform方法来清空xml中的<pre>标记,而不是进行替换。

case class bodyresults(text:String,code:String)

val bodyudf = udf{ (body: String)  =>

    // Appending body tag explicitly to the xml before parsing  
    val xmlElems = XML.loadString(s""" <body> ${body} </body> """)
    // extract the code inside the req
    val code = (xmlElems \\ "body" \\ "pre" \\ "code").text

    val text = (xmlElems \\ "body").text.replaceAll(s"${code}" ,"" )

    bodyresults(text, code)
}

此UDF将返回StructType,如:

org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,StructType(StructField(text,StringType,true), StructField(code,StringType,true)),List(StringType))

您现在可以在posts5数据框中调用它:

val posts5 = df.withColumn("codetext", bodyudf($"xml") )
posts5: org.apache.spark.sql.DataFrame = [xml: string, codetext: struct<text:string,code:string>]

要提取特定列:

posts5.select($"codetext.code" ).show
+--------------------+
|                code|
+--------------------+
|decimal trans = t...|
+--------------------+