如何通过匹配列值火花来查找相似的行?

时间:2020-05-14 11:47:23

标签: scala apache-spark apache-spark-sql

所以我有一个像这样的数据集

{"customer":"customer-1","attributes":{"att-a":"att-a-7","att-b":"att-b-3","att-c":"att-c-10","att-d":"att-d-10","att-e":"att-e-15","att-f":"att-f-11","att-g":"att-g-2","att-h":"att-h-7","att-i":"att-i-5","att-j":"att-j-14"}}
{"customer":"customer-2","attributes":{"att-a":"att-a-9","att-b":"att-b-7","att-c":"att-c-12","att-d":"att-d-4","att-e":"att-e-10","att-f":"att-f-4","att-g":"att-g-13","att-h":"att-h-4","att-i":"att-i-1","att-j":"att-j-13"}}
{"customer":"customer-3","attributes":{"att-a":"att-a-10","att-b":"att-b-6","att-c":"att-c-1","att-d":"att-d-1","att-e":"att-e-13","att-f":"att-f-12","att-g":"att-g-9","att-h":"att-h-6","att-i":"att-i-7","att-j":"att-j-4"}}
{"customer":"customer-4","attributes":{"att-a":"att-a-9","att-b":"att-b-14","att-c":"att-c-7","att-d":"att-d-4","att-e":"att-e-8","att-f":"att-f-7","att-g":"att-g-14","att-h":"att-h-9","att-i":"att-i-13","att-j":"att-j-3"}}

我已经像这样将DF中的数据展平了

+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
|   att-a|   att-b|   att-c|   att-d|   att-e|   att-f|   att-g|   att-h|   att-i|   att-j|   customer|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a-7| att-b-3|att-c-10|att-d-10|att-e-15|att-f-11| att-g-2| att-h-7| att-i-5|att-j-14| customer-1|
| att-a-9| att-b-7|att-c-12| att-d-4|att-e-10| att-f-4|att-g-13| att-h-4| att-i-1|att-j-13| customer-2|

我想完成comapreColumns函数。 它比较两个数据帧(userDF和flattenedDF)的列,并返回一个新的DF作为样本输出。

该怎么做?像这样,将flattenedDF中的每一行和每一列与userDF比较,如果它们匹配,则计数++?例如att-a与att-a att-b与att-b。

  def getCustomer(customerID: String)(dataFrame: DataFrame): DataFrame = {
    dataFrame.filter($"customer" === customerID).toDF()
  }

  def compareColumns(customerID: String)(dataFrame: DataFrame): DataFrame = {
    val userDF = dataFrame.transform(getCustomer(customerID))
    userDF.printSchema()
    userDF
  }

示例输出:

+--------------------+-----------+
| customer   | similarity_score |
+--------------------+-----------+
|customer-1  | -1  | its the same as the reference customer so to ignore '-1'
|customer-12 |  2  |
|customer-3  |  2  |
|customer-44 |  5  |
|customer-5  |  1  |
|customer-6  | 10  |

谢谢

0 个答案:

没有答案