我在进行转型的特定阶段完全迷失了。
我计划通过使用SQL或pyspark来实现它。
我的输入格式为。
id name
1 A
1 C
1 E
2 A
2 B
2 C
2 E
2 F
3 A
3 E
3 D
你能不能帮助我获得这种输出格式。
id name rating
1 A 1
1 B 0
1 C 1
1 D 0
1 E 1
1 F 0
2 A 1
2 B 1
2 C 1
2 D 0
2 E 1
2 F 1
3 A 1
3 B 0
3 C 0
3 D 1
3 E 1
3 F 0
由于sql查询需要永远只是想看看我是否可以使用pyspark将数据集提供给ALS来实现相同的目标。
换句话说,如何在id和name之间生成所有可能的组合,如果组合存在于表中,则将等级设置为1,否则为0?
答案 0 :(得分:4)
用其他词语生成id和的所有可能组合 name ..如果组合中存在组合,则评级为1 否则0?
您需要将两个派生表与CROSS JOIN
结合使用才能使每个ID和名称组合成为可能。
<强>查询强>
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
DISTINCT
id
FROM
Table1
) AS distinct_id
CROSS JOIN (
SELECT
DISTINCT
name
FROM
Table1
) AS distinct_name
) AS table_combination
ORDER BY
id ASC
, name ASC
<强>结果强>
| id | name |
|----|------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 1 | E |
| 1 | F |
| 2 | A |
| 2 | B |
| 2 | C |
| 2 | D |
| 2 | E |
| 2 | F |
| 3 | A |
| 3 | B |
| 3 | C |
| 3 | D |
| 3 | E |
| 3 | F |
参见演示http://sqlfiddle.com/#!9/ba5f17/17
现在我们可以将LEFT JOIN
与CASE WHEN column IS NULL ... END
结合使用来检查当前表中是否存在该组合,或者是否生成了该组合。
<强>查询强>
SELECT
Table_combination.id
, Table_combination.name
, (
CASE
WHEN Table1.id IS NULL
THEN 0
ELSE 1
END
) AS rating
FROM (
SELECT
*
FROM (
SELECT
DISTINCT
id
FROM
Table1
) AS distinct_id
CROSS JOIN (
SELECT
DISTINCT
name
FROM
Table1
) AS distinct_name
) AS Table_combination
LEFT JOIN
Table1
ON
Table_combination.id = Table1.id
AND
Table_combination.name = Table1.name
ORDER BY
Table_combination.id ASC
, Table_combination.name ASC
<强>结果强>
| id | name | rating |
|----|------|--------|
| 1 | A | 1 |
| 1 | B | 0 |
| 1 | C | 1 |
| 1 | D | 0 |
| 1 | E | 1 |
| 1 | F | 0 |
| 2 | A | 1 |
| 2 | B | 1 |
| 2 | C | 1 |
| 2 | D | 0 |
| 2 | E | 1 |
| 2 | F | 1 |
| 3 | A | 1 |
| 3 | B | 0 |
| 3 | C | 0 |
| 3 | D | 1 |
| 3 | E | 1 |
| 3 | F | 0 |
答案 1 :(得分:0)
我根据Raymond Nijlands的回答做了一个函数:
def expand_grid(df, df_name, col_a, col_b, col_c):
df.createOrReplaceTempView(df_name)
expand_sql = f"""
SELECT
expanded.{col_a},
expanded.{col_b},
CASE
WHEN {df_name}.{col_c} IS NULL THEN 0
ELSE {df_name}.{col_c}
END AS {col_c}
FROM (
SELECT *
FROM (
SELECT DISTINCT {col_a}
FROM {df_name}
) AS {col_a}s
CROSS JOIN (
SELECT DISTINCT {col_b}
FROM {df_name}
) AS {col_b}s
) AS expanded
LEFT JOIN {df_name}
ON expanded.{col_a} = {df_name}.{col_a}
AND expanded.{col_b} = {df_name}.{col_b}
"""
print(expand_sql)
result = spark.sql(expand_sql)
return result
在此问题中的用法:
expand_grid(df=df, df_name="df_name", col_a="id", col_b="name", col_c="rating")