我在AWS EMR上遇到Spark性能问题,使用Windows将数据聚合到6M行数据集中。
+--------------------+-------------------+---------------+--------------------+-------------------+----------+--------------+---------+
| employerName|employerLegalStatus|employerAddress| _metaData| naicCode| _jobTitle|_worksiteState|_uniqueId|
+--------------------+-------------------+---------------+--------------------+-------------------+----------+--------------+---------+
| Advanced Technology| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title1| ST1| 1|
| Advanced Technology| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[2234, 1, code 2,,]|Job Title1| ST2| 2|
| Advanced Technology| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title2| ST1| 3|
| Advanced Technology| Inc.| [state ->]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title1| ST3| 4|
我正在尝试通过合并“相似”记录来对数据进行重复数据删除。
在一次主迭代中,我找到所有相互匹配的相关记录,并通过添加_parentId
字段(即最近记录的_uniqueId
)将每个指向最近的记录。
+--------------------+-------------------+---------------+--------------------+-------------------+----------+--------------+---------+---------------------+--------------------+-----------------+-------------+------------------------+----------------+------------------------+-----------------+----------------+-----------+---------------------+---------------------------+-------------+------+----------+------------------+---------+
| employerName|employerLegalStatus|employerAddress| _metaData| naicCode| _jobTitle|_worksiteState|_uniqueId|__expectedCombination| _id|_reviewedCategory|_toBeReviewed|petitionCountPerVisaType|organizationFlag|potentialVisaSponsorship|otherEmployerName|requestTypeCount|primaryCrop|natureOfTemporaryNeed|petitionCountPerCitizenship|employerPhone|lawyer|primarySub|totalPetitionCount|_parentId|
+--------------------+-------------------+---------------+--------------------+-------------------+----------+--------------+---------+---------------------+--------------------+-----------------+-------------+------------------------+----------------+------------------------+-----------------+----------------+-----------+---------------------+---------------------------+-------------+------+----------+------------------+---------+
| Advanced Technology| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title1| ST1| 1| [,]|[5bbce91bdec23c60...| null| false| 1| null| null| null| 1| null| null| 1| null| null| null| 1| 1|
| Advanced Technology| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[2234, 1, code 2,,]|Job Title1| ST2| 2| [1,]|[5bbce91bdec23c60...| null| false| 1| null| null| null| 1| null| null| 1| null| null| null| 1| 1|
| Advanced Technology| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title2| ST1| 3| [1,]|[5bbce91bdec23c60...| null| false| 1| null| null| null| 1| null| null| 1| null| null| null| 1| 1|
| Imerys Clays| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title2| ST1| 9| [,]|[5bbce91bdec23c60...| null| false| 1| null| null| null| 1| null| null| 1| null| null| null| 1| 9|
| Imerys Clays| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[3234, 1, code 3,,]|Job Title1| ST3| 10| [9,]|[5bbce91bdec23c60...| null| false| 1| null| null| null| 1| null| null| 1| null| null| null| 1| 9|
| Imerys Clays| Inc.| [state -> NJ]|[/, test.xlsx, TE...|[3234, 1, code 3,,]|Job Title2| ST1| 8| [9,]|[5bbce91bdec23c60...| null| false| 1| null| null| null| 1| null| null| 1| null| null| null| 1| 9|
接下来,我想使用_parentId
将所有记录与同一个windows
分组在一起,因为我想保持父记录“原样”并且仅将某些字段组合为列表。我可以使用groupBy
,但是随后我会丢失订单信息,并且不得不将记录的分组列表分解为新的记录。
val mergeWindow = Window
.partitionBy(MetaData.Field.parentId.col)
val cols = originalFields.toList.map(col)
combined
.select(
collect_list($"_metaData.path").over(mergeWindow).as("__metaDataPaths") ::
collect_list('_id).over(mergeWindow).as("__ids") ::
collect_list('_uniqueId).over(mergeWindow).as("__uniqueIds") ::
collect_list('naicCode).over(mergeWindow).as("__naicCodes") ::
collect_list('_jobTitle).over(mergeWindow).as("jobTitles") ::
collect_list('_worksiteState).over(mergeWindow).as("worksiteStates") ::
collect_list('employerAddress).over(mergeWindow).as("__otherEmployerAddress") ::
collect_list('requestTypeCount).over(mergeWindow).as("__requestTypeCount") ::
collect_list('petitionCountPerVisaType).over(mergeWindow).as("__petitionCountPerVisaType") ::
collect_list('petitionCountPerCitizenship).over(mergeWindow).as("__petitionCountPerCitizenship") ::
count('totalPetitionCount).over(mergeWindow).as("totalPetitionCount") ::
cols: _*
)
.filter('_uniqueId === '_parentId)
因此,我将保留一些字段,将某些字段作为列表收集(以便进一步处理)
该计划如下:
== Physical Plan ==
*(3) Project [__metaDataPaths#4962, __ids#4964, __uniqueIds#4966, __naicCodes#4968, jobTitles#4970, worksiteStates#4972, __otherEmployerAddress#4974, __requestTypeCount#4976, __petitionCountPerVisaType#4978, __petitionCountPerCitizenship#4980, totalPetitionCount#4983L, naicCode#4495, _jobTitle#4496, __expectedCombination#4499, _id#4500, _worksiteState#4497, _parentId#4515L, _metaData#4494, _toBeReviewed#4502, _uniqueId#4498L, employerName#4491, employerAddress#4493, employerLegalStatus#4492, potentialVisaSponsorship#4944, ... 8 more fields]
+- *(3) Filter (isnotnull(_uniqueId#4498L) && (_uniqueId#4498L = _parentId#4515L))
+- Window [collect_list(_metaData#4494.originalFilePath, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __metaDataPaths#4962, collect_list(_id#4500, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __ids#4964, collect_list(_uniqueId#4498L, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __uniqueIds#4966, collect_list(naicCode#4495, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __naicCodes#4968, collect_list(_jobTitle#4496, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS jobTitles#4970, collect_list(_worksiteState#4497, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS worksiteStates#4972, collect_list(employerAddress#4493, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __otherEmployerAddress#4974, collect_list(requestTypeCount#4507, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __requestTypeCount#4976, collect_list(petitionCountPerVisaType#4503, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __petitionCountPerVisaType#4978, collect_list(petitionCountPerCitizenship#4510, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __petitionCountPerCitizenship#4980, count(totalPetitionCount#4514) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS totalPetitionCount#4983L, collect_list(potentialVisaSponsorship#4505, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS potentialVisaSponsorship#4944, collect_list(_reviewedCategory#4501, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _reviewedCategory#4946, collect_list(otherEmployerName#4506, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS otherEmployerName#4958, collect_list(employerPhone#4511, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS employerPhone#4956, collect_list(lawyer#4512, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS lawyer#4960, collect_list(organizationFlag#4504, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS organizationFlag#4952, collect_list(primarySub#4513, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS primarySub#4948, collect_list(primaryCrop#4508, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS primaryCrop#4950, collect_list(natureOfTemporaryNeed#4509, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS natureOfTemporaryNeed#4954], [_parentId#4515L]
+- *(2) Sort [_parentId#4515L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(_parentId#4515L, 1000)
+- *(1) Project [naicCode#4495, _jobTitle#4496, __expectedCombination#4499, _id#4500, _worksiteState#4497, _parentId#4515L, _metaData#4494, _toBeReviewed#4502, _uniqueId#4498L, employerName#4491, employerAddress#4493, employerLegalStatus#4492, requestTypeCount#4507, petitionCountPerVisaType#4503, petitionCountPerCitizenship#4510, totalPetitionCount#4514, potentialVisaSponsorship#4505, _reviewedCategory#4501, otherEmployerName#4506, employerPhone#4511, lawyer#4512, organizationFlag#4504, primarySub#4513, primaryCrop#4508, natureOfTemporaryNeed#4509]
+- *(1) Filter isnotnull(_parentId#4515L)
+- *(1) FileScan parquet default.combinationrule4[employerName#4491,employerLegalStatus#4492,employerAddress#4493,_metaData#4494,naicCode#4495,_jobTitle#4496,_worksiteState#4497,_uniqueId#4498L,__expectedCombination#4499,_id#4500,_reviewedCategory#4501,_toBeReviewed#4502,petitionCountPerVisaType#4503,organizationFlag#4504,potentialVisaSponsorship#4505,otherEmployerName#4506,requestTypeCount#4507,primaryCrop#4508,natureOfTemporaryNeed#4509,petitionCountPerCitizenship#4510,employerPhone#4511,lawyer#4512,primarySub#4513,totalPetitionCount#4514,_parentId#4515L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/Users/tlous/Development/Scala/importer-pipeline/spark-warehouse/combinati..., PartitionFilters: [], PushedFilters: [IsNotNull(_parentId)], ReadSchema: struct<employerName:string,employerLegalStatus:string,employerAddress:map<string,string>,_metaDat...
最后它应该以1M条记录结束。但是存在偏斜。某些记录将包含10k +个匹配项,某些记录将包含1个或完全不匹配。
这项工作顺利运行了几千条记录,并且没有很大的偏差,但是在5节点16核EMR群集上将花费5个小时以上。所以不好。
如何优化此查询?我应该使用groupBy吗,如果是,应该如何选择所有分组字段的最新记录。有没有更聪明的方法?