Question

您是否知道有什么聪明的方法可以跨SQL的多个列标识唯一的一组值？

示例，输入：

col_1 col_2 col_3 col_4
A     A     A     A
A     B     A     A
A     B     C     D
D     C     B     A

所需的输出：

col_1 col_2 col_3 col_4  col_output
A     A     A     A      'A'
A     B     A     A      'A','B' 
A     B     C     D      'A','B','C','D'
D     C     B     D      'B','C','D'

谢谢。

Answer 1

尝试使用UDF：

import org.apache.spark.sql.functions._
val dropDuplicates = udf((arr: Seq[String]) => {arr.map(x => "'"+ x +"'").distinct.mkString(",")})

df.withColumn("col_output",dropDuplicates(array("col_1", "col_2", "col_3","col_4"))).show(false)

输出：

+-----+-----+-----+-----+---------------+
|col_1|col_2|col_3|col_4|col_output     |
+-----+-----+-----+-----+---------------+
|A    |A    |A    |A    |'A'            |
|A    |B    |A    |A    |'A','B'        |
|A    |B    |C    |D    |'A','B','C','D'|
|D    |C    |B    |D    |'D','C','B'    |
+-----+-----+-----+-----+---------------+

Answer 2

您可以使用一个巨大的case表达式。使用标准语法：

select t.*,
       ('''' || col_1 || ''';' ||
        (case when col2 not in (col1) then '''' || col_2 || ''';' else '' end) ||
        (case when col3 not in (col1, col2) then '''' || col_3 || ''';' else '' end) ||
        (case when col4 not in (col1, col2, col3) then '''' || col_4 || ''';' else '' end)
       ) as col_output
from t;

这实际上在末尾留下了分号。摆脱这种情况并不难，但是最好的方法取决于数据库。

跨列查找唯一的一组值[SQL]

2 个答案: