我有一个大表或数据框,有超过5000万条记录和135列。 现在,对于每一行,我需要对超过50列进行验证。
所以基本上每行我需要从所有25个表中获得相应的值。
我这里只列出了4张小桌子,但在我的情况下,我会有25张这样的桌子。
例如,这是我的一个名为CityId Validation的验证。
要进行CityId验证,我们需要Table2中的TownCode,方法是从Tables1传递physicalstateorprovincecode,physicalcountrycode和physicalcityname
使用TownCode我必须转到Table3传递physicalcountrycode,physicalstateorprovincecode和TownCode并获取CityID。
如果CityID可用,那么它是正确的错误。
以下是我的数据框架的样子。
以上逻辑是其中一个列的示例,但我必须为超过50列进行此类验证。
我们能在火花中做到这一点吗?
表1主表(5000万条记录)
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|filler1|dunsnumber|businessname |tradestylename |registeredaddressindicator|physicalstreetaddress |physicalstreetaddress2|physicalcityname|physicalstateorprovincename|physicalcountryname|physicalcitycode|physicalcountycode|physicalstateorprovincecode|physicalstateorprovinceabbreviation|physicalcountrycode|physicalpostalcode|physicalcontinentcode|mailingaddress|mailingcityname|mailingcountyname|mailingstateorprovincename|mailingcountryname|mailingcitycode|mailingcountycode|mailingstateorprovincecode|mailingstateorprovinceabbreviation|mailingcountrycode|mailingpostalcode|mailingcontinentcode|nationalidentificationnumber|nationalidentificationsystemcode|countrytelephoneaccesscode|telephonenumber|cabletelex|faxnumber |chiefexecutiveofficername|chiefexecutiveofficertitle|lineofbusiness |sic1|sic2|sic3|sic4|sic5|sic6|primarylocalactivitycode|activityindicator|yearstarted|annualsaleslocal |annualsalesindicator|annualsalesinusd|currencycode|employeeshere|employeeshereindicator|employeestotal|employeestotalindicator|includeprinciplesindicator|importexportagentindicator|legalstatus|filler2|statuscode|subsidiarycode|filler3|previousdunsnumber|financialstatementdate|filler4|headquarterorparentdunsnumber|headquarterorparentbusinessname |headquarterorparentstreetaddress|headquarterorparentcityname|headquarterorparentstateorprovincename|headquarterorparentcountryname|headquarterorparentcitycode|headquarterorparentcountycode|headquarterorparentstateorprovinceabbreviation|headquarterorparentcountrycode|headquarterorparentpostalcode|headquarterorparentcontinentcode|filler5|domesticultimatedunsnumbers|domesticultimatebusinessname |domesticultimatephysicalstreetaddress|domesticultimatecityname|domesticultimatestateorprovincename|domesticultimatecitycode|domesticultimatecountrycode|domesticultimatestateorprovinceabbreviation|domesticultimatepostalcode|globalultimateindicator|filler6|globalultimatedunsnumber|globalultimatebusinessname |globalultimatestreetaddress |globalultimatecityname|globalultimatestateorprovincename|globalultimatecountryname|globalultimatecitycode|globalultimatecountycode|globalultimatestateorprovinceabbreviation|globalultimatecountrycode|globalultimatepostalcode|globalultimatecontinentcode|numberoffamilymembers|diascode |hierarchycode|filler7|filler8|urldomain |naics1|naics2|naics3|naics4|naics5|naics6|publicprivateindicator|obindicator|latitude |longitude |oporactdescpart1 |oporactdescpart2|oporactdescpart3|oporactdescpart4|oporactdescpart5|nixieindicator|delistindicator|primary8digitsic|primary8digitdescription |primarynaicsdescription |natlidfull|transactionalindicator|
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
| |001007108 |DOLGENCORP, LLC |DOLLAR GENERAL |N |1342 PINE ST | |UNADILLA |GEORGIA |USA |008857 |296 |019 |GA |805 |31091 |6 | | | | | | |000 |000 | |000 | | | | |0001 |4786279585 | | |EVE MEADOWS |MANAGER |VARIETY STORES |5331| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000006 |1 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |068331990 |DOLGENCORP, LLC |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |USA |003754 |203 |TN |805 |370722171 |6 | |006946172 |DOLLAR GENERAL CORPORATION |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |003754 |805 |TN |370722171 |N | |006946172 |DOLLAR GENERAL CORPORATION |100 MISSION RDG |GOODLETTSVILLE |TENNESSEE |USA |003754 |203 |TN |805 |370722171 |6 |11210 |005479269|02 | | | |452319| | | | | | |N |+32.252708|-083.740074| | | | | |N |N |53310000 |VARIETY STORES |ALL OTHER GENERAL MERCHANDISE STORES | |C |
| |001132690 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|ADVANCE AMERICA |N |332 N L ROGERS WELLS BLVD| |GLASGOW |KENTUCKY |USA |003211 |060 |033 |KY |805 |421411300 |6 | | | | | | |000 |000 | |000 | | | | |0001 |2706511990 | | |LISA BROWN |MANAGER |PERSONAL CREDIT INSTITUTIONS |6141| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000002 |0 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |179469978 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|135 N CHURCH ST |SPARTANBURG |SOUTH CAROLINA |USA |008468 |839 |SC |805 |293065138 |6 | |078454395 |EAGLE U.S. SUB, INC. |135 N CHURCH ST |SPARTANBURG |SOUTH CAROLINA |008468 |805 |SC |293065138 |N | |811589639 |GRUPO ELEKTRA, S.A.B. DE C.V. |AV. FERROCARRIL DE RIO FRIO NO. 419 CJ|CIUDAD DE MEXICO |CIUDAD DE MEXICO |MEXICO |009100 |000 |CDMX |489 |09310 |5 |04316 |008037671|03 | | |WWW.ADVANCEAMERICA.NET |522291| | | | | | |N |+37.006016|-085.924526| | | | | |N |N |61410000 |PERSONAL CREDIT INSTITUTIONS |CONSUMER LENDING | |C |
| |001134456 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION | |N |126 DANIEL ST | |PORTSMOUTH |NEW HAMPSHIRE |USA |006885 |725 |057 |NH |805 |038013857 |6 | | | | | | |000 |000 | |000 | | | | |0001 | | | |BARBARA CONDA |MANAGER |NATIONAL COMMERCIAL BANKS, NSK |6021| | | | | | |000 |0000 |000000000000000000| |000000000000000 | |0000015 |0 | | |Y |G |000 | |2 |0 | |000000000 |00000000 | |072147077 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |850 MAIN ST FL 6 |BRIDGEPORT |CONNECTICUT |USA |000677 |112 |CT |805 |066044917 |6 | |800407673 |PEOPLE'S UNITED FINANCIAL, INC. |850 MAIN ST |BRIDGEPORT |CONNECTICUT |000677 |805 |CT |066044917 |N | |800407673 |PEOPLE'S UNITED FINANCIAL, INC. |850 MAIN ST |BRIDGEPORT |CONNECTICUT |USA |000677 |112 |CT |805 |066044917 |6 |00583 |014029370|02 | | |WWW.BRANCHES.PEOPLES.COM|522110| | | | | | |N |+43.077690|-070.755372| | | | | |P |N |60210000 |NATIONAL COMMERCIAL BANKS |COMMERCIAL BANKING | |C |
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
引用表非常小的表不超过10MB
表2
+------------+------------+------------+-------------+---------+--------------+
|COUNTRY_CODE|COUNTRY_NAME|PROVINCE |PROVINCE_CODE|TOWN_CODE|TOWN_NAME |
+------------+------------+------------+-------------+---------+--------------+
|021 |ANDORRA |null |000 |000002 |ALDOSA |
|021 |ANDORRA |null |000 |000013 |EL TARTER |
|033 |ARGENTINA |BUENOS AIRES|001 |000223 |OLIVOS |
|033 |ARGENTINA |BUENOS AIRES|001 |000226 |PABLO PODESTA |
+------------+------------+------------+-------------+---------+--------------+
表3
+------+--------+-----------+---------+
|CityID|TownCode|CountryCode|StateCode|
+------+--------+-----------+---------+
|110880|006129 |805 |001 |
|110888|007554 |805 |005 |
|111164|004661 |805 |009 |
|111368|005193 |805 |075 |
+------+--------+-----------+---------+
表4标识符
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|IdentifierTypeId|Value |EntityId |ValueTypeId|EffectiveFrom |ProviderId|ProviderType|SourceUpdateDate|SourceLink|SourceType|EffectiveToNACode|EffectiveToMinus|EffectiveTo |EffectiveFromNACode|EffectiveFromPlus|NaCode|IsPrimary|ValueOrder|ValueTypeCode|EntityType|EntityTypeId|SysFrom |SysFileId |
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|320114 |3339 |4294963171|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |333997|4294963154|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |333999|4294963153|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114 |334 |4294963152|320114 |1/1/1997 12:00:00 AM|null |null |null |null |null |NA02 |null |12/31/9999 12:00:00 AM|null |null |null |False |1 |Naics |Industry |404008 |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
答案 0 :(得分:0)
是的,你可以在Spark中完成。有两种方法:
broadcast
,然后在大表上使用--patch-module apimod=G:\projets\wires\wires\wires\apimod\target\apimod-1.0-SNAPSHOT-tests.jar
或--module-path
broadcast join
这是第一种方法的基本例子。
filter
编辑(添加广播加入示例):
where