如何在pyspark join中选择*
impression_rdd.join(
click_rdd,
impression_rdd.session_id == click_rdd.session_id,
"left_outer"
).select(impression_rdd.*) <------- pseudo code; how do you do this?
基本上,sql等价
SELECT impression.* FROM impression LEFT JOIN click on (impression.session_id = click.session_id)
答案 0 :(得分:2)
您可以简单地为您的伪代码添加别名和几个引号:
(impressions.alias("impressions")
.join(clicks, ["id"], "left_outer")
.select("impressions.*"))
答案 1 :(得分:1)
zero323答案的另外两个等效结构:
(impressions.join(clicks, 'session_id', 'left_outer')
.select(*impressions.columns))
如果您只有一列,请说&#39; count&#39;,要放入右侧表格,这可能更具可读性。
(impressions.join(clicks, 'session_id', 'left_outer')
.drop('count'))