使用Scala无法在Apache Spark独立的spark数据帧上执行用户定义的函数

时间:2017-09-25 17:05:30

标签: scala apache-spark xml-parsing spark-dataframe user-defined-functions

我有这样的UDF

case class bodyresults(text:String,code:String)

val bodyudf = udf{ (body: String)  =>
    //Appending body tag explicitly to the xml before parsing  
    val xmlElems = xml.XML.loadString(s"""<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE body [<!ENTITY nbsp "&#160;">]><body>${body}</body>""")
    // extract the code inside the req
    val code = (xmlElems \\ "body" \\"code").text
    val text = (xmlElems \\ "body").text.replace(s"${code}" ,"" )
    bodyresults(text, code)
}

我正在尝试将Body字符串转换为代码,文本字符串

CODE:在名为code。的

TEXT:其他所有。

列体类型为String,内容如下所示

 <p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p>
 <pre><code>decimal trans = trackBar1.Value / 5000;
 this.Opacity = trans;
</code></pre>
<p>When I build the application, it gives the following error:</p>
<blockquote>
  <p>Cannot implicitly convert type 'decimal' to 'double'.</p>
</blockquote>
<p>I tried using <code>trans</code> and <code>double</code> but then the 
control doesn't work. This code worked fine in a past VB.NET project. </p>
,While applying opacity to a form should we use a decimal or double value?

我正在尝试使用以下命令

来使用此UDF
val posts5=posts4.withColumn("codetext",bodyudf(col("Body")))
posts5.select("codetext").show()

这会导致错误

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string) => struct<text:string,code:string>)

Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 129; The element type "body" must be terminated by the matching end-tag "</body>"

但正如你在UDF中看到的那样,我附加了body标签并将其关闭。

注意:令人惊讶的是,如果执行以下命令

,它可以正常工作
posts5.select("codetext").show(19)
+--------------------+
|            codetext|
+--------------------+
|[Given a  represe...|
|[Is there any sta...|
|[What is the diff...|
|[How do I store b...|
|[If I have a trig...|
|[How do you page ...|
|[Does anyone know...|
|[Does anybody kno...|
|[What are some gu...|
|[There are severa...|
|[I wrote a window...|
|[How do I format ...|
|[One may not alwa... |
|[Are PHP variable...|
|[What's the simpl...|
|[Does anyone know...|
|[I'm looking for ...| 
|[What is the corr...|
|[I was wondering ...|
+--------------------+

但如果我使用超过19的任何数字导致错误

posts5.select("codetext").show(20)
or
posts5.select("codetext").show()

以防我在第20行附加正文字符串

<p>I have a Queue&lt;T&gt; object that I have initialised to a capacity of 2, but obviously that is just the capacity and it keeps expanding as I add items.  Is there already an object that automatically dequeues an item when the limit is reached, or is the best solution to create my own inherited class?</p>,Limit size of Queue<T> in .NET?

我无法弄清楚这个错误的原因是什么。我无法在网上找到相关信息,请让我知道什么是导致错误的?

编辑:

我删除了第20行,因为该字符串缺少结束标记。 但现在错误发生在第19行。

posts5.select("codetext").show(18) //18 or below works fine
posts5.select("codetext").show(19) // does not work

我已经把第19行中的字符串直接传递给了函数,它的工作正常。 但是,当我将整个列传递给UDF时,它无法正常工作?

0 个答案:

没有答案