Question

我是火花和编程语言的新手。我需要一些帮助来根据每个标签解析XML文件。

这是我的小例子输入文件：

XML File:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
  <book style="autobiography">
    <author>
      <first-name>Joe</first-name>
      <last-name>Bob</last-name>
      <award>Trenton Literary Review Honorable Mention</award>
    </author>
    <price>12</price>
  </book>
 </bookstore>

XPATH for above file:

/bookstore[@specialty="novel"]/book[@style="autobiography"]/price
/bookstore[@specialty="novel"]/book[@style="autobiography"]/author
/bookstore[@specialty="novel"]/book[@style="autobiography"]
/bookstore[@specialty="novel"]

现在我想读取xpath并根据每个标签解析文件（bookstore.txt，book.txt，author.txt）

Bookstore.txt：

UUID= 1233455 (onfly have to create)
specialty="novel"

Book.txt：

UUID= 1233455 (coming from bookstore)
style="autobiography"
<price>12</price>

Author.txt：

UUID= 9876534(onfly generate and link to book file)
<first-name>Joe</first-name>
      <last-name>Bob</last-name>
      <award>Trenton Literary Review Honorable Mention</award>

请有人帮我解决一下。

先谢谢你..

Answer 1

使用Spark SQL和spark-xml模块：

用于使用Apache Spark解析和查询XML数据的库，用于Spark SQL和DataFrame。

如何处理XML数据集？

1 个答案: