生成两列之间的所有可能组合和一个指示器,以显示源表中是否存在该组合

时间:2018-04-25 14:09:55

标签: pyspark pyspark-sql

我在进行转型的特定阶段完全迷失了。

我计划通过使用SQL或pyspark来实现它。

我的输入格式为。

id  name
1   A
1   C
1   E
2   A
2   B
2   C
2   E
2   F
3   A
3   E
3   D

你能不能帮助我获得这种输出格式。

id name rating
1  A    1
1  B    0
1  C    1
1  D    0
1  E    1
1  F    0
2  A    1
2  B    1
2  C    1
2  D    0
2  E    1
2  F    1
3  A    1
3  B    0
3  C    0
3  D    1
3  E    1
3  F    0

由于sql查询需要永远只是想看看我是否可以使用pyspark将数据集提供给ALS来实现相同的目标。

换句话说,如何在id和name之间生成所有可能的组合,如果组合存在于表中,则将等级设置为1,否则为0?

2 个答案:

答案 0 :(得分:4)

  

用其他词语生成id和的所有可能组合   name ..如果组合中存在组合,则评级为1   否则0?

您需要将两个派生表与CROSS JOIN结合使用才能使每个ID和名称组合成为可能。

<强>查询

SELECT 
 *
FROM ( 

 SELECT 
   *
  FROM (
    SELECT
      DISTINCT
       id
    FROM
      Table1    
  ) AS distinct_id
  CROSS JOIN (
    SELECT 
      DISTINCT 
        name
    FROM 
    Table1 
  ) AS distinct_name
) AS table_combination

 ORDER BY 
    id ASC
  , name ASC

<强>结果

| id | name |
|----|------|
|  1 |    A |
|  1 |    B |
|  1 |    C |
|  1 |    D |
|  1 |    E |
|  1 |    F |
|  2 |    A |
|  2 |    B |
|  2 |    C |
|  2 |    D |
|  2 |    E |
|  2 |    F |
|  3 |    A |
|  3 |    B |
|  3 |    C |
|  3 |    D |
|  3 |    E |
|  3 |    F |

参见演示http://sqlfiddle.com/#!9/ba5f17/17

现在我们可以将LEFT JOINCASE WHEN column IS NULL ... END结合使用来检查当前表中是否存在该组合,或者是否生成了该组合。

<强>查询

SELECT
   Table_combination.id
 , Table_combination.name
 , (
     CASE 
      WHEN Table1.id IS NULL
      THEN 0
      ELSE 1
     END
   ) AS rating
FROM ( 

  SELECT 
   *
  FROM (
    SELECT
      DISTINCT
       id
    FROM
      Table1    
  ) AS distinct_id
  CROSS JOIN (
    SELECT 
      DISTINCT 
        name
    FROM 
    Table1 
  ) AS distinct_name
) AS Table_combination

LEFT JOIN 
 Table1
ON
   Table_combination.id = Table1.id
 AND
   Table_combination.name = Table1.name

ORDER BY 
   Table_combination.id ASC
 , Table_combination.name ASC

<强>结果

| id | name | rating |
|----|------|--------|
|  1 |    A |      1 |
|  1 |    B |      0 |
|  1 |    C |      1 |
|  1 |    D |      0 |
|  1 |    E |      1 |
|  1 |    F |      0 |
|  2 |    A |      1 |
|  2 |    B |      1 |
|  2 |    C |      1 |
|  2 |    D |      0 |
|  2 |    E |      1 |
|  2 |    F |      1 |
|  3 |    A |      1 |
|  3 |    B |      0 |
|  3 |    C |      0 |
|  3 |    D |      1 |
|  3 |    E |      1 |
|  3 |    F |      0 |

参见演示http://sqlfiddle.com/#!9/ba5f17/13

答案 1 :(得分:0)

我根据Raymond Nijlands的回答做了一个函数:

def expand_grid(df, df_name, col_a, col_b, col_c):
    df.createOrReplaceTempView(df_name)
    expand_sql = f"""
        SELECT
            expanded.{col_a},
            expanded.{col_b},
            CASE
                WHEN {df_name}.{col_c} IS NULL THEN 0
                ELSE {df_name}.{col_c}
            END AS {col_c}
        FROM ( 
            SELECT *
            FROM (
                SELECT DISTINCT {col_a}
                FROM {df_name}    
            ) AS {col_a}s
            CROSS JOIN (
                SELECT DISTINCT {col_b}
                FROM {df_name}
            ) AS {col_b}s
        ) AS expanded
        LEFT JOIN {df_name}
        ON expanded.{col_a} = {df_name}.{col_a}
        AND expanded.{col_b} = {df_name}.{col_b}
    """
    print(expand_sql)
    result = spark.sql(expand_sql)
    return result

在此问题中的用法:

expand_grid(df=df, df_name="df_name", col_a="id", col_b="name", col_c="rating")