需要查询在Spark中具有ArrayType和StructType字段的XML文件

时间:2019-04-09 06:02:16

标签: java apache-spark apache-spark-xml

我需要查询大零件文件(大约12 + GB),其中每行都是XML记录,如果给定条件匹配,则获取一些特定的唯一ID。我正在使用Spark和Java。我为XML文件创建了一个扁平模式。我得到了结果,但是由于我已经分解了所有ArrayType列,这导致了扁平化模式中的许多重复行,因此花费了更多时间。

例如,让我们考虑书籍的XML记录。我已经将每本书作为rowTag加载到Spark中。展平架构后,它看起来就像在表中一样。

现在,如果我想查询以获取toi的评分为4的图书ID:

+-----+----------------+--------------------+---------------+-----+------------+-----------------------+---------------------+------------------+--------------------+
|  _id|          author|         description|   genres_genre|price|publish_date|ratings_rating_goodread|ratings_rating_kindle|ratings_rating_toi|               title|
+-----+----------------+--------------------+---------------+-----+------------+-----------------------+---------------------+------------------+--------------------+
|bk106|Randall, Cynthia|When Carla meets ...|        Romance| 4.95|  2000-09-02|                    4.5|                    4|                 5|         Lover Birds|
|bk106|Randall, Cynthia|When Carla meets ...|        Romance| 4.95|  2000-09-02|                    4.5|                    4|                 3|         Lover Birds|
|bk107|  Thurman, Paula|A deep sea diver ...|        Romance| 4.95|  2000-11-02|                    4.5|                    4|                 5|       Splish Splash|
|bk107|  Thurman, Paula|A deep sea diver ...|        Romance| 4.95|  2000-11-02|                    4.5|                    4|                 3|       Splish Splash|
|bk108|   Knorr, Stefan|An anthology of h...|         Horror| 4.95|  2000-12-06|                    4.5|                    4|                 5|     Creepy Crawlies|
|bk108|   Knorr, Stefan|An anthology of h...|         Horror| 4.95|  2000-12-06|                    4.5|                    4|                 3|     Creepy Crawlies|
|bk109|    Kress, Peter|After an inadvert...|Science Fiction| 6.95|  2000-11-02|                    4.5|                    4|                 5|        Paradox Lost|
|bk109|    Kress, Peter|After an inadvert...|Science Fiction| 6.95|  2000-11-02|                    4.5|                    4|                 3|        Paradox Lost|
|bk110|    O'Brien, Tim|Microsoft's .NET ...|Science Fiction|36.95|  2000-12-09|                    4.5|                    4|                 5|Microsoft .NET: T...|
|bk110|    O'Brien, Tim|Microsoft's .NET ...|Science Fiction|36.95|  2000-12-09|                    4.5|                    4|                 3|Microsoft .NET: T...|
|bk110|    O'Brien, Tim|Microsoft's .NET ...|       Computer|36.95|  2000-12-09|                    4.5|                    4|                 5|Microsoft .NET: T...|
|bk110|    O'Brien, Tim|Microsoft's .NET ...|       Computer|36.95|  2000-12-09|                    4.5|                    4|                 3|Microsoft .NET: T...|
|bk111|    O'Brien, Tim|The Microsoft MSX...|Science Fiction|36.95|  2000-12-01|                    4.5|                    4|                 5|MSXML3: A Compreh...|
|bk111|    O'Brien, Tim|The Microsoft MSX...|Science Fiction|36.95|  2000-12-01|                    4.5|                    4|                 3|MSXML3: A Compreh...|
|bk111|    O'Brien, Tim|The Microsoft MSX...|       Computer|36.95|  2000-12-01|                    4.5|                    4|                 5|MSXML3: A Compreh...|
|bk111|    O'Brien, Tim|The Microsoft MSX...|       Computer|36.95|  2000-12-01|                    4.5|                    4|                 3|MSXML3: A Compreh...|
|bk112|     Galos, Mike|Microsoft Visual ...|Science Fiction|49.95|  2001-04-16|                    4.5|                    4|                 5|Visual Studio 7: ...|
|bk112|     Galos, Mike|Microsoft Visual ...|Science Fiction|49.95|  2001-04-16|                    4.5|                    4|                 3|Visual Studio 7: ...|
|bk112|     Galos, Mike|Microsoft Visual ...|       Computer|49.95|  2001-04-16|                    4.5|                    4|                 5|Visual Studio 7: ...|
|bk112|     Galos, Mike|Microsoft Visual ...|       Computer|49.95|  2001-04-16|                    4.5|                    4|                 3|Visual Studio 7: ...|
+-----+----------------+--------------------+---------------+-----+------------+-----------------------+---------------------+------------------+--------------------+

对于较小的文件来说,它可以很好地工作,但是由于存在更多的ArrayType数据,因此需要更多时间才能获得结果,因此它会产生更多的重复项。对于具有1100多个xml记录的400MB文件,需要60秒才能在一台计算机上获得结果。是否可以加快此过程。

有人可以让我知道是否有其他解决方案来解决此问题?

<?xml version="1.0"?>
<catalog>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genres>
        <genre>Romance</genre>
      </genres>
      <ratings>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>5</toi>
        </rating>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>3</toi>
        </rating>
      </ratings>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genres>
        <genre>Romance</genre>
      </genres>
      <ratings>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>5</toi>
        </rating>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>3</toi>
        </rating>
      </ratings>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genres>
        <genre>Horror</genre>
      </genres>
      <ratings>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>5</toi>
        </rating>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>3</toi>
        </rating>
      </ratings>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genres>
        <genre>Science Fiction</genre>
      </genres>
      <ratings>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>5</toi>
        </rating>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>3</toi>
        </rating>
      </ratings>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description>
   </book>
   <book id="bk110">
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genres>
        <genre>Science Fiction</genre>
        <genre>Computer</genre>
      </genres>
      <ratings>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>5</toi>
        </rating>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>3</toi>
        </rating>
      </ratings>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genres>
        <genre>Science Fiction</genre>
        <genre>Computer</genre>
      </genres>
      <ratings>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>5</toi>
        </rating>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>3</toi>
        </rating>
      </ratings>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genres>
        <genre>Science Fiction</genre>
        <genre>Computer</genre>
      </genres>
      <ratings>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>5</toi>
        </rating>
        <rating>
            <goodread>4.5</goodread>
            <kindle>4</kindle>
            <toi>3</toi>
        </rating>
      </ratings>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.</description>
   </book>
</catalog>

Books.xml

select items.name,customers.customerID, items.userdefined1
from customers
inner join Items on items.userdefined1 = (case when customers.group1 =1 then 'group1' 
                                               when customers.group2 =1 then 'group2' 
                                               when customers.group3 =1 then 'group3' 
                                               when customers.group4 =1 then 'group4' 
                                               when customers.group5 =1 then 'group5' 
                                          end )

0 个答案:

没有答案