我有一个从S3读取JSON文件的spark作业。 JSON中的一个字段是XML字符串。我正在提取值,将其展平并将其写回S3。这是代码:
t = sqc.read.json(path)
t.registerTempTable("t")
t1 = sqc.sql("""
select t.id,
t.post_datetime,
t.trans_xml
from t
""")
t1.registerTempTable("t1")
info = sqc.sql("""
select t.id
,t.post_datetime
,check_info_name
,check_info_value
from t1
lateral view inline(array(struct(xpath_string(trans_xml,'transaction/check-info/info[@name="store"]/@name')
,xpath_string(trans_xml,'transaction/check-info/info[@name="store"]/@value'))
,struct(xpath_string(trans_xml,'transaction/check-info/info[@name="emp_name"]/@name')
,xpath_string(trans_xml,'transaction/check-info/info[@name="emp_name"]/@value'))
,struct(xpath_string(trans_xml,'transaction/check-info/info[@name="pos_datetime"]/@name')
,xpath_string(trans_xml,'transaction/check-info/info[@name="pos_datetime"]/@value'))
,struct(xpath_string(trans_xml,'transaction/check-info/info[@name="order_number"]/@name')
,xpath_string(trans_xml,'transaction/check-info/info[@name="order_number"]/@value'))
,struct(xpath_string(trans_xml,'transaction/check-info/info[@name="unique_check"]/@name')
,xpath_string(trans_xml,'transaction/check-info/info[@name="unique_check"]/@value'))
)) x as check_info_name,check_info_value
""")
info.registerTempTable("info_vw")
info.write.mode("overwrite").save(temp_path, "parquet")
该作业在本地模式下正常工作。但是当我使用 - master yarn 运行它时,作业会因此错误而失败。
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(1,0,ResultTask,ExceptionFailure(org.apache.spark.SparkException,Task failed while writing rows,[Ljava.lang.StackTraceElement;@1d57caf9,org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Invalid XML document: org.xml.sax.SAXException: FWK005 parse may not be called while parsing.
<?xml version='1.0' encoding='ISO-8859-1'?>
<transaction capture_name="2017-01-10_130832.cap.pickle.gz" processing_time="0.267999887466"><raw_receipt>H4sIACEjdVgC/6VT0W7aMBQtr5byBdGks1WbWgkHOwkQeBuMqTygMoUfMOQWIkI8xaZp9vVzijqB
hNqHXSlOIt977vG5x4y14Yf+rOOrjs87+DD8zze+uvk0ufHY+afqpMd1rZpbOUoGMRdY7nRJiOWI
x0PJhYgij/WFkFjostKE1FZEFj+nuIWUHlvpgjLdxeNDF3E0CCMuk3DosZSqZ8qwbsZYUJEboyB7
UvRCIYeQY5GMoxC/Dw6AqgPmP/iqUqVxmL3vXMaDJO577IzqL9sgzf8Q5pYO10+4rPINeYxzDv5v
uRrtDvcYIN3P4AsmaqNLzLZbfMN0R2QIk/2TcadEFAxfiTiVsNJWFe9IfMpNVUEGK/WCu2EQ9r/e
X8kUQSvRCe9upizm5f11zDgQocemyuw+GnA/EMJl7lS5pfczRTBKzqR1Jlq5qj0afcSTrtp3hfXR
5CUZ08WyIOUU2egDQW1VXl446EE7nR6rjCo3xDHS5aKdsJDSeUYkUXgxxbdW5rXPc25ym5dbnCwY
vHUqyHrsaLAvdY2de2pClmewOlONcxSs2rdljqbHJA55ebQEc3SGa+DErOs6sFQU5gTriF9S+J8r
8xcUpuikegMAAA==
</raw_receipt><device id="69658" pos="" sw_version="v4.3.43" unique_id="9C:8E:99:EC:7E:2F" /><check-info check_mode="" check_number="" check_type="normal" post_type="active" processing_time="0" ric_datetime="2017-01-10 13:08:32 -0500" splits="1" store_id="57833"><info name="store" value="Store 19864" /><info name="emp_name" value="Melissa" /><info name="pos_datetime" value="1/10/2017 1:08:32 pm" /><info name="order_number" value="146845" /><info name="unique_check" value="1/10/2017 1:08:32 pmX" /></check-info><receipts><receipt account_id="" input_id="" method="" program_id="NULL" receipt_counter="10819" serial_id="" status=""><items><item guest="0" id="" name="6" Bacon Egg & Cheese Bkfst F" price="3.75" quantity="1" serial="" sku="" trigger_id="" type_id=""><nutrition /></item></items><messages><message id="-1" serial="-1" type="" /><message id="7483" serial="7483" type="6" /><message id="0" serial="0" type="74" /><message id="33879" serial="33879" type="456" /><message id="30680" serial="30680" type="466" /><message id="5164" serial="5164" type="7" /><message id="0" serial="0" type="" /></messages></receipt></receipts><day_parts><day_part id="627" /><day_part id="625" /><day_part id="499" /><day_part id="485" /><day_part id="493" /><day_part id="473" /><day_part id="407" /><day_part id="406" /><day_part id="507" /><day_part id="502" /><day_part id="503" /><day_part id="408" /><day_part id="632" /><day_part id="439" /><day_part id="475" /><day_part id="501" /><day_part id="169" /><day_part id="93" /><day_part id="483" /><day_part id="481" /><day_part id="512" /><day_part id="432" /><day_part id="430" /><day_part id="531" /><day_part id="530" /><day_part id="110" /></day_parts><totals><total ext="" name="Sub Total" string_type_id="101" value="3.75" /><total ext="" name="Sales Tax (7.25%)" string_type_id="" value="0.27" /><total ext="" name="Total (Eat In)" string_type_id="100" value="4.02" /><total ext="" name="Cash" string_type_id="" value="5.00" /><total ext="" name="Change" string_type_id="" value="0.98" /></totals></transaction>
at org.apache.spark.sql.catalyst.expressions.xml.UDFXPathUtil.eval(UDFXPathUtil.java:72)
at org.apache.spark.sql.catalyst.expressions.xml.UDFXPathUtil.evalString(UDFXPathUtil.java:81)
at org.apache.spark.sql.catalyst.expressions.xml.XPathString.nullSafeEval(xpath.scala:146)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypeCreator.scala:198)
at org.apache.spark.sql.catalyst.expressions.CreateArray$$anonfun$eval$1.apply(complexTypeCreator.scala:48)
at org.apache.spark.sql.catalyst.expressions.CreateArray$$anonfun$eval$1.apply(complexTypeCreator.scala:48)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.expressions.CreateArray.eval(complexTypeCreator.scala:48)
at org.apache.spark.sql.catalyst.expressions.Inline.eval(generators.scala:277)
at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$apply$1.apply(GenerateExec.scala:75)
at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$apply$1.apply(GenerateExec.scala:72)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
Caused by: javax.xml.xpath.XPathExpressionException: org.xml.sax.SAXException: FWK005 parse may not be called while parsing.
at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:305)
at org.apache.spark.sql.catalyst.expressions.xml.UDFXPathUtil.eval(UDFXPathUtil.java:70)
... 44 more
Caused by: org.xml.sax.SAXException: FWK005 parse may not be called while parsing.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:302)
... 45 more
我无法在网上找到有关此错误的任何内容。有什么想法吗?
谢谢, 拉姆。
答案 0 :(得分:0)
使用此修补程序可以解决此问题。 https://issues.apache.org/jira/browse/SPARK-24542
因为叫法 XPathExpression#evaluate(org.xml.sax.InputSource,QName) 将是线程安全问题。
应改用此方法 XPathExpression#evaluate(Document,QName)。