我需要查询大零件文件(大约12 + GB),其中每行都是XML记录,如果给定条件匹配,则获取一些特定的唯一ID。我正在使用Spark和Java。我为XML文件创建了一个扁平模式。我得到了结果,但是由于我已经分解了所有ArrayType列,这导致了扁平化模式中的许多重复行,因此花费了更多时间。
例如,让我们考虑书籍的XML记录。我已经将每本书作为rowTag加载到Spark中。展平架构后,它看起来就像在表中一样。
现在,如果我想查询以获取toi的评分为4的图书ID:
+-----+----------------+--------------------+---------------+-----+------------+-----------------------+---------------------+------------------+--------------------+
| _id| author| description| genres_genre|price|publish_date|ratings_rating_goodread|ratings_rating_kindle|ratings_rating_toi| title|
+-----+----------------+--------------------+---------------+-----+------------+-----------------------+---------------------+------------------+--------------------+
|bk106|Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| 4.5| 4| 5| Lover Birds|
|bk106|Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| 4.5| 4| 3| Lover Birds|
|bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| 4.5| 4| 5| Splish Splash|
|bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| 4.5| 4| 3| Splish Splash|
|bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| 4.5| 4| 5| Creepy Crawlies|
|bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| 4.5| 4| 3| Creepy Crawlies|
|bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| 4.5| 4| 5| Paradox Lost|
|bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| 4.5| 4| 3| Paradox Lost|
|bk110| O'Brien, Tim|Microsoft's .NET ...|Science Fiction|36.95| 2000-12-09| 4.5| 4| 5|Microsoft .NET: T...|
|bk110| O'Brien, Tim|Microsoft's .NET ...|Science Fiction|36.95| 2000-12-09| 4.5| 4| 3|Microsoft .NET: T...|
|bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09| 4.5| 4| 5|Microsoft .NET: T...|
|bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09| 4.5| 4| 3|Microsoft .NET: T...|
|bk111| O'Brien, Tim|The Microsoft MSX...|Science Fiction|36.95| 2000-12-01| 4.5| 4| 5|MSXML3: A Compreh...|
|bk111| O'Brien, Tim|The Microsoft MSX...|Science Fiction|36.95| 2000-12-01| 4.5| 4| 3|MSXML3: A Compreh...|
|bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01| 4.5| 4| 5|MSXML3: A Compreh...|
|bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01| 4.5| 4| 3|MSXML3: A Compreh...|
|bk112| Galos, Mike|Microsoft Visual ...|Science Fiction|49.95| 2001-04-16| 4.5| 4| 5|Visual Studio 7: ...|
|bk112| Galos, Mike|Microsoft Visual ...|Science Fiction|49.95| 2001-04-16| 4.5| 4| 3|Visual Studio 7: ...|
|bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16| 4.5| 4| 5|Visual Studio 7: ...|
|bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16| 4.5| 4| 3|Visual Studio 7: ...|
+-----+----------------+--------------------+---------------+-----+------------+-----------------------+---------------------+------------------+--------------------+
对于较小的文件来说,它可以很好地工作,但是由于存在更多的ArrayType数据,因此需要更多时间才能获得结果,因此它会产生更多的重复项。对于具有1100多个xml记录的400MB文件,需要60秒才能在一台计算机上获得结果。是否可以加快此过程。
有人可以让我知道是否有其他解决方案来解决此问题?
<?xml version="1.0"?>
<catalog>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genres>
<genre>Romance</genre>
</genres>
<ratings>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>5</toi>
</rating>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>3</toi>
</rating>
</ratings>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genres>
<genre>Romance</genre>
</genres>
<ratings>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>5</toi>
</rating>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>3</toi>
</rating>
</ratings>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genres>
<genre>Horror</genre>
</genres>
<ratings>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>5</toi>
</rating>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>3</toi>
</rating>
</ratings>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genres>
<genre>Science Fiction</genre>
</genres>
<ratings>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>5</toi>
</rating>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>3</toi>
</rating>
</ratings>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genres>
<genre>Science Fiction</genre>
<genre>Computer</genre>
</genres>
<ratings>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>5</toi>
</rating>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>3</toi>
</rating>
</ratings>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genres>
<genre>Science Fiction</genre>
<genre>Computer</genre>
</genres>
<ratings>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>5</toi>
</rating>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>3</toi>
</rating>
</ratings>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genres>
<genre>Science Fiction</genre>
<genre>Computer</genre>
</genres>
<ratings>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>5</toi>
</rating>
<rating>
<goodread>4.5</goodread>
<kindle>4</kindle>
<toi>3</toi>
</rating>
</ratings>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
Books.xml
select items.name,customers.customerID, items.userdefined1
from customers
inner join Items on items.userdefined1 = (case when customers.group1 =1 then 'group1'
when customers.group2 =1 then 'group2'
when customers.group3 =1 then 'group3'
when customers.group4 =1 then 'group4'
when customers.group5 =1 then 'group5'
end )