如何通过将一个大数据框连接到spark

时间:2018-05-02 11:55:42

标签: scala apache-spark apache-spark-sql spark-dataframe

我有一个大表或数据框,有超过5000万条记录和135列。 现在,对于每一行,我需要对超过50列进行验证。

所以基本上每行我需要从所有25个表中获得相应的值。

我这里只列出了4张小桌子,但在我的情况下,我会有25张这样的桌子。

例如,这是我的一个名为CityId Validation的验证。

要进行CityId验证,我们需要Table2中的TownCode,方法是从Tables1传递physicalstateorprovincecode,physicalcountrycode和physicalcityname

使用TownCode我必须转到Table3传递physicalcountrycode,physicalstateorprovincecode和TownCode并获取CityID。

如果CityID可用,那么它是正确的错误。

以下是我的数据框架的样子。

以上逻辑是其中一个列的示例,但我必须为超过50列进行此类验证。

我们能在火花中做到这一点吗?

表1主表(5000万条记录)

+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|filler1|dunsnumber|businessname                               |tradestylename              |registeredaddressindicator|physicalstreetaddress    |physicalstreetaddress2|physicalcityname|physicalstateorprovincename|physicalcountryname|physicalcitycode|physicalcountycode|physicalstateorprovincecode|physicalstateorprovinceabbreviation|physicalcountrycode|physicalpostalcode|physicalcontinentcode|mailingaddress|mailingcityname|mailingcountyname|mailingstateorprovincename|mailingcountryname|mailingcitycode|mailingcountycode|mailingstateorprovincecode|mailingstateorprovinceabbreviation|mailingcountrycode|mailingpostalcode|mailingcontinentcode|nationalidentificationnumber|nationalidentificationsystemcode|countrytelephoneaccesscode|telephonenumber|cabletelex|faxnumber |chiefexecutiveofficername|chiefexecutiveofficertitle|lineofbusiness                           |sic1|sic2|sic3|sic4|sic5|sic6|primarylocalactivitycode|activityindicator|yearstarted|annualsaleslocal  |annualsalesindicator|annualsalesinusd|currencycode|employeeshere|employeeshereindicator|employeestotal|employeestotalindicator|includeprinciplesindicator|importexportagentindicator|legalstatus|filler2|statuscode|subsidiarycode|filler3|previousdunsnumber|financialstatementdate|filler4|headquarterorparentdunsnumber|headquarterorparentbusinessname            |headquarterorparentstreetaddress|headquarterorparentcityname|headquarterorparentstateorprovincename|headquarterorparentcountryname|headquarterorparentcitycode|headquarterorparentcountycode|headquarterorparentstateorprovinceabbreviation|headquarterorparentcountrycode|headquarterorparentpostalcode|headquarterorparentcontinentcode|filler5|domesticultimatedunsnumbers|domesticultimatebusinessname          |domesticultimatephysicalstreetaddress|domesticultimatecityname|domesticultimatestateorprovincename|domesticultimatecitycode|domesticultimatecountrycode|domesticultimatestateorprovinceabbreviation|domesticultimatepostalcode|globalultimateindicator|filler6|globalultimatedunsnumber|globalultimatebusinessname            |globalultimatestreetaddress           |globalultimatecityname|globalultimatestateorprovincename|globalultimatecountryname|globalultimatecitycode|globalultimatecountycode|globalultimatestateorprovinceabbreviation|globalultimatecountrycode|globalultimatepostalcode|globalultimatecontinentcode|numberoffamilymembers|diascode |hierarchycode|filler7|filler8|urldomain               |naics1|naics2|naics3|naics4|naics5|naics6|publicprivateindicator|obindicator|latitude  |longitude  |oporactdescpart1                                                                                                                                                                                                                         |oporactdescpart2|oporactdescpart3|oporactdescpart4|oporactdescpart5|nixieindicator|delistindicator|primary8digitsic|primary8digitdescription                                    |primarynaicsdescription                                        |natlidfull|transactionalindicator|
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|       |001007108 |DOLGENCORP, LLC                            |DOLLAR GENERAL              |N                         |1342 PINE ST             |                      |UNADILLA        |GEORGIA                    |USA                |008857          |296               |019                        |GA                                 |805                |31091             |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |4786279585     |          |          |EVE MEADOWS              |MANAGER                   |VARIETY STORES                           |5331|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000006      |1                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |068331990                    |DOLGENCORP, LLC                            |100 MISSION RDG                 |GOODLETTSVILLE             |TENNESSEE                             |USA                           |003754                     |203                          |TN                                            |805                           |370722171                    |6                               |       |006946172                  |DOLLAR GENERAL CORPORATION            |100 MISSION RDG                      |GOODLETTSVILLE          |TENNESSEE                          |003754                  |805                        |TN                                         |370722171                 |N                      |       |006946172               |DOLLAR GENERAL CORPORATION            |100 MISSION RDG                       |GOODLETTSVILLE        |TENNESSEE                        |USA                      |003754                |203                     |TN                                       |805                      |370722171               |6                          |11210                |005479269|02           |       |       |                        |452319|      |      |      |      |      |                      |N          |+32.252708|-083.740074|                                                                                                                                                                                                                                         |                |                |                |                |N             |N              |53310000        |VARIETY STORES                                              |ALL OTHER GENERAL MERCHANDISE STORES                           |          |C                     |
|       |001132690 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|ADVANCE AMERICA             |N                         |332 N L ROGERS WELLS BLVD|                      |GLASGOW         |KENTUCKY                   |USA                |003211          |060               |033                        |KY                                 |805                |421411300         |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |2706511990     |          |          |LISA BROWN               |MANAGER                   |PERSONAL CREDIT INSTITUTIONS             |6141|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000002      |0                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |179469978                    |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|135 N CHURCH ST                 |SPARTANBURG                |SOUTH CAROLINA                        |USA                           |008468                     |839                          |SC                                            |805                           |293065138                    |6                               |       |078454395                  |EAGLE U.S. SUB, INC.                  |135 N CHURCH ST                      |SPARTANBURG             |SOUTH CAROLINA                     |008468                  |805                        |SC                                         |293065138                 |N                      |       |811589639               |GRUPO ELEKTRA, S.A.B. DE C.V.         |AV. FERROCARRIL DE RIO FRIO NO. 419 CJ|CIUDAD DE MEXICO      |CIUDAD DE MEXICO                 |MEXICO                   |009100                |000                     |CDMX                                     |489                      |09310                   |5                          |04316                |008037671|03           |       |       |WWW.ADVANCEAMERICA.NET  |522291|      |      |      |      |      |                      |N          |+37.006016|-085.924526|                                                                                                                                                                                                                                         |                |                |                |                |N             |N              |61410000        |PERSONAL CREDIT INSTITUTIONS                                |CONSUMER LENDING                                               |          |C                     |
|       |001134456 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |                            |N                         |126 DANIEL ST            |                      |PORTSMOUTH      |NEW HAMPSHIRE              |USA                |006885          |725               |057                        |NH                                 |805                |038013857         |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |               |          |          |BARBARA CONDA            |MANAGER                   |NATIONAL COMMERCIAL BANKS, NSK           |6021|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000015      |0                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |072147077                    |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |850 MAIN ST FL 6                |BRIDGEPORT                 |CONNECTICUT                           |USA                           |000677                     |112                          |CT                                            |805                           |066044917                    |6                               |       |800407673                  |PEOPLE'S UNITED FINANCIAL, INC.       |850 MAIN ST                          |BRIDGEPORT              |CONNECTICUT                        |000677                  |805                        |CT                                         |066044917                 |N                      |       |800407673               |PEOPLE'S UNITED FINANCIAL, INC.       |850 MAIN ST                           |BRIDGEPORT            |CONNECTICUT                      |USA                      |000677                |112                     |CT                                       |805                      |066044917               |6                          |00583                |014029370|02           |       |       |WWW.BRANCHES.PEOPLES.COM|522110|      |      |      |      |      |                      |N          |+43.077690|-070.755372|                                                                                                                                                                                                                                         |                |                |                |                |P             |N              |60210000        |NATIONAL COMMERCIAL BANKS                                   |COMMERCIAL BANKING                                             |          |C                     |
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+

引用表非常小的表不超过10MB

表2

+------------+------------+------------+-------------+---------+--------------+
|COUNTRY_CODE|COUNTRY_NAME|PROVINCE    |PROVINCE_CODE|TOWN_CODE|TOWN_NAME     |
+------------+------------+------------+-------------+---------+--------------+
|021         |ANDORRA     |null        |000          |000002   |ALDOSA        |
|021         |ANDORRA     |null        |000          |000013   |EL TARTER     |
|033         |ARGENTINA   |BUENOS AIRES|001          |000223   |OLIVOS        |
|033         |ARGENTINA   |BUENOS AIRES|001          |000226   |PABLO PODESTA |
+------------+------------+------------+-------------+---------+--------------+

表3

+------+--------+-----------+---------+
|CityID|TownCode|CountryCode|StateCode|   
+------+--------+-----------+---------+
|110880|006129  |805        |001      |
|110888|007554  |805        |005      |
|111164|004661  |805        |009      |
|111368|005193  |805        |075      |
+------+--------+-----------+---------+

表4标识符

+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|IdentifierTypeId|Value |EntityId  |ValueTypeId|EffectiveFrom       |ProviderId|ProviderType|SourceUpdateDate|SourceLink|SourceType|EffectiveToNACode|EffectiveToMinus|EffectiveTo           |EffectiveFromNACode|EffectiveFromPlus|NaCode|IsPrimary|ValueOrder|ValueTypeCode|EntityType|EntityTypeId|SysFrom             |SysFileId           |
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|320114          |3339  |4294963171|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |333997|4294963154|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |333999|4294963153|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |334   |4294963152|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+

1 个答案:

答案 0 :(得分:0)

是的,你可以在Spark中完成。有两种方法:

  1. 在小表上执行broadcast,然后在大表上使用--patch-module apimod=G:\projets\wires\wires\wires\apimod\target\apimod-1.0-SNAPSHOT-tests.jar--module-path
  2. 执行broadcast join
  3. 这是第一种方法的基本例子。

    filter

    编辑(添加广播加入示例):

    where