Question

我在AWS EMR上遇到Spark性能问题，使用Windows将数据聚合到6M行数据集中。

+--------------------+-------------------+---------------+--------------------+-------------------+----------+--------------+---------+
|        employerName|employerLegalStatus|employerAddress|           _metaData|           naicCode| _jobTitle|_worksiteState|_uniqueId|
+--------------------+-------------------+---------------+--------------------+-------------------+----------+--------------+---------+
| Advanced Technology|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title1|           ST1|        1|
| Advanced Technology|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[2234, 1, code 2,,]|Job Title1|           ST2|        2|
| Advanced Technology|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title2|           ST1|        3|
| Advanced Technology|               Inc.|     [state ->]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title1|           ST3|        4|

我正在尝试通过合并“相似”记录来对数据进行重复数据删除。

在一次主迭代中，我找到所有相互匹配的相关记录，并通过添加_parentId字段（即最近记录的_uniqueId）将每个指向最近的记录。

+--------------------+-------------------+---------------+--------------------+-------------------+----------+--------------+---------+---------------------+--------------------+-----------------+-------------+------------------------+----------------+------------------------+-----------------+----------------+-----------+---------------------+---------------------------+-------------+------+----------+------------------+---------+
|        employerName|employerLegalStatus|employerAddress|           _metaData|           naicCode| _jobTitle|_worksiteState|_uniqueId|__expectedCombination|                 _id|_reviewedCategory|_toBeReviewed|petitionCountPerVisaType|organizationFlag|potentialVisaSponsorship|otherEmployerName|requestTypeCount|primaryCrop|natureOfTemporaryNeed|petitionCountPerCitizenship|employerPhone|lawyer|primarySub|totalPetitionCount|_parentId|
+--------------------+-------------------+---------------+--------------------+-------------------+----------+--------------+---------+---------------------+--------------------+-----------------+-------------+------------------------+----------------+------------------------+-----------------+----------------+-----------+---------------------+---------------------------+-------------+------+----------+------------------+---------+
| Advanced Technology|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title1|           ST1|        1|                  [,]|[5bbce91bdec23c60...|             null|        false|                       1|            null|                    null|             null|               1|       null|                 null|                          1|         null|  null|      null|                 1|        1|
| Advanced Technology|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[2234, 1, code 2,,]|Job Title1|           ST2|        2|                 [1,]|[5bbce91bdec23c60...|             null|        false|                       1|            null|                    null|             null|               1|       null|                 null|                          1|         null|  null|      null|                 1|        1|
| Advanced Technology|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title2|           ST1|        3|                 [1,]|[5bbce91bdec23c60...|             null|        false|                       1|            null|                    null|             null|               1|       null|                 null|                          1|         null|  null|      null|                 1|        1|
|        Imerys Clays|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[1234, 1, code 1,,]|Job Title2|           ST1|        9|                  [,]|[5bbce91bdec23c60...|             null|        false|                       1|            null|                    null|             null|               1|       null|                 null|                          1|         null|  null|      null|                 1|        9|
|        Imerys Clays|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[3234, 1, code 3,,]|Job Title1|           ST3|       10|                 [9,]|[5bbce91bdec23c60...|             null|        false|                       1|            null|                    null|             null|               1|       null|                 null|                          1|         null|  null|      null|                 1|        9|
|        Imerys Clays|               Inc.|  [state -> NJ]|[/, test.xlsx, TE...|[3234, 1, code 3,,]|Job Title2|           ST1|        8|                 [9,]|[5bbce91bdec23c60...|             null|        false|                       1|            null|                    null|             null|               1|       null|                 null|                          1|         null|  null|      null|                 1|        9|

接下来，我想使用_parentId将所有记录与同一个windows分组在一起，因为我想保持父记录“原样”并且仅将某些字段组合为列表。我可以使用groupBy，但是随后我会丢失订单信息，并且不得不将记录的分组列表分解为新的记录。

val mergeWindow = Window
  .partitionBy(MetaData.Field.parentId.col)

val cols = originalFields.toList.map(col)

combined
      .select(
        collect_list($"_metaData.path").over(mergeWindow).as("__metaDataPaths") ::
          collect_list('_id).over(mergeWindow).as("__ids") ::
          collect_list('_uniqueId).over(mergeWindow).as("__uniqueIds") ::
          collect_list('naicCode).over(mergeWindow).as("__naicCodes") ::
          collect_list('_jobTitle).over(mergeWindow).as("jobTitles") ::
          collect_list('_worksiteState).over(mergeWindow).as("worksiteStates") ::
          collect_list('employerAddress).over(mergeWindow).as("__otherEmployerAddress") ::
          collect_list('requestTypeCount).over(mergeWindow).as("__requestTypeCount") ::
          collect_list('petitionCountPerVisaType).over(mergeWindow).as("__petitionCountPerVisaType") ::
          collect_list('petitionCountPerCitizenship).over(mergeWindow).as("__petitionCountPerCitizenship") ::
          count('totalPetitionCount).over(mergeWindow).as("totalPetitionCount") ::
          cols: _*
      )
      .filter('_uniqueId === '_parentId)

因此，我将保留一些字段，将某些字段作为列表收集（以便进一步处理）

该计划如下：

== Physical Plan ==
*(3) Project [__metaDataPaths#4962, __ids#4964, __uniqueIds#4966, __naicCodes#4968, jobTitles#4970, worksiteStates#4972, __otherEmployerAddress#4974, __requestTypeCount#4976, __petitionCountPerVisaType#4978, __petitionCountPerCitizenship#4980, totalPetitionCount#4983L, naicCode#4495, _jobTitle#4496, __expectedCombination#4499, _id#4500, _worksiteState#4497, _parentId#4515L, _metaData#4494, _toBeReviewed#4502, _uniqueId#4498L, employerName#4491, employerAddress#4493, employerLegalStatus#4492, potentialVisaSponsorship#4944, ... 8 more fields]
+- *(3) Filter (isnotnull(_uniqueId#4498L) && (_uniqueId#4498L = _parentId#4515L))
   +- Window [collect_list(_metaData#4494.originalFilePath, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __metaDataPaths#4962, collect_list(_id#4500, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __ids#4964, collect_list(_uniqueId#4498L, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __uniqueIds#4966, collect_list(naicCode#4495, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __naicCodes#4968, collect_list(_jobTitle#4496, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS jobTitles#4970, collect_list(_worksiteState#4497, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS worksiteStates#4972, collect_list(employerAddress#4493, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __otherEmployerAddress#4974, collect_list(requestTypeCount#4507, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __requestTypeCount#4976, collect_list(petitionCountPerVisaType#4503, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __petitionCountPerVisaType#4978, collect_list(petitionCountPerCitizenship#4510, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __petitionCountPerCitizenship#4980, count(totalPetitionCount#4514) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS totalPetitionCount#4983L, collect_list(potentialVisaSponsorship#4505, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS potentialVisaSponsorship#4944, collect_list(_reviewedCategory#4501, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _reviewedCategory#4946, collect_list(otherEmployerName#4506, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS otherEmployerName#4958, collect_list(employerPhone#4511, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS employerPhone#4956, collect_list(lawyer#4512, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS lawyer#4960, collect_list(organizationFlag#4504, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS organizationFlag#4952, collect_list(primarySub#4513, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS primarySub#4948, collect_list(primaryCrop#4508, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS primaryCrop#4950, collect_list(natureOfTemporaryNeed#4509, 0, 0) windowspecdefinition(_parentId#4515L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS natureOfTemporaryNeed#4954], [_parentId#4515L]
      +- *(2) Sort [_parentId#4515L ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(_parentId#4515L, 1000)
            +- *(1) Project [naicCode#4495, _jobTitle#4496, __expectedCombination#4499, _id#4500, _worksiteState#4497, _parentId#4515L, _metaData#4494, _toBeReviewed#4502, _uniqueId#4498L, employerName#4491, employerAddress#4493, employerLegalStatus#4492, requestTypeCount#4507, petitionCountPerVisaType#4503, petitionCountPerCitizenship#4510, totalPetitionCount#4514, potentialVisaSponsorship#4505, _reviewedCategory#4501, otherEmployerName#4506, employerPhone#4511, lawyer#4512, organizationFlag#4504, primarySub#4513, primaryCrop#4508, natureOfTemporaryNeed#4509]
               +- *(1) Filter isnotnull(_parentId#4515L)
                  +- *(1) FileScan parquet default.combinationrule4[employerName#4491,employerLegalStatus#4492,employerAddress#4493,_metaData#4494,naicCode#4495,_jobTitle#4496,_worksiteState#4497,_uniqueId#4498L,__expectedCombination#4499,_id#4500,_reviewedCategory#4501,_toBeReviewed#4502,petitionCountPerVisaType#4503,organizationFlag#4504,potentialVisaSponsorship#4505,otherEmployerName#4506,requestTypeCount#4507,primaryCrop#4508,natureOfTemporaryNeed#4509,petitionCountPerCitizenship#4510,employerPhone#4511,lawyer#4512,primarySub#4513,totalPetitionCount#4514,_parentId#4515L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/Users/tlous/Development/Scala/importer-pipeline/spark-warehouse/combinati..., PartitionFilters: [], PushedFilters: [IsNotNull(_parentId)], ReadSchema: struct<employerName:string,employerLegalStatus:string,employerAddress:map<string,string>,_metaDat...

最后它应该以1M条记录结束。但是存在偏斜。某些记录将包含10k +个匹配项，某些记录将包含1个或完全不匹配。

这项工作顺利运行了几千条记录，并且没有很大的偏差，但是在5节点16核EMR群集上将花费5个小时以上。所以不好。

如何优化此查询？我应该使用groupBy吗，如果是，应该如何选择所有分组字段的最新记录。有没有更聪明的方法？

火花窗口性能问题

0 个答案: