Question

我希望从其他2个表中获取列以更新＆＃34; a＆＃34;表。这就像mysql更新语句 -

   UPDATE bucket_summary a,geo_count b, geo_state c
   SET a.category_name=b.county_name,
   a.state_code=c.state_code
   WHERE a.category_id=b.county_geoid
   AND b.state_fips=c.state_fips
   AND a.category='county'

如何写这个？

  condition = [a.category_id=b.county_geoid, b.state_fips=c.state_fips, a.category='county']
  df_a = df_a.join([df_b, df_c], condition, how= left)

对我不起作用

Answer 1

希望这有帮助！

import pyspark.sql.functions as f

########
# data
########
df_a = sc.parallelize([
    [None,  None,  '123', 'country'],
    ['sc2', 'cn2', '234', 'state'],
    ['sc3', 'cn3', '456', 'country']
]).toDF(('state_code', 'category_name', 'category_id', 'category'))
df_a.show()

df_b = sc.parallelize([
    ['789','United States', 'asdf'],
    ['234','California',    'abc'],
    ['456','United Kingdom','xyz']
]).toDF(('county_geoid', 'country_name', 'state_fips'))

df_c = sc.parallelize([
    ['US','asdf'],
    ['CA','abc'],
    ['UK','xyz']
]).toDF(('state_code', 'state_fips'))
df_c = df_c.select(*(f.col(x).alias(x + '_df_c') for x in df_c.columns))

########
# update df_a with values from df_b & df_c
########
df_temp = df_a.join(df_b, [df_a.category_id == df_b.county_geoid, df_a.category=='country'], 'left').drop('county_geoid')
df_temp = df_temp.withColumn('category_name_new',
                   f.when(df_temp.country_name.isNull(), df_temp.category_name).
                   otherwise(df_temp.country_name)).drop('category_name','country_name').\
                   withColumnRenamed('category_name_new','category_name')
df_a = df_temp.join(df_c,[df_temp.state_fips == df_c.state_fips_df_c, df_temp.category=='country'], 'left').drop('state_fips_df_c','state_fips')
df_a = df_a.withColumn('state_code_new',
                   f.when(df_a.state_code_df_c.isNull(), df_a.state_code).
                   otherwise(df_a.state_code_df_c)).drop('state_code_df_c','state_code').\
                   withColumnRenamed('state_code_new','state_code')
df_a.show()

原创df_a：

+----------+-------------+-----------+--------+
|state_code|category_name|category_id|category|
+----------+-------------+-----------+--------+
|      null|         null|        123| country|
|       sc2|          cn2|        234|   state|
|       sc3|          cn3|        456| country|
+----------+-------------+-----------+--------+

输出，即最终df_a：

+-----------+--------+--------------+----------+
|category_id|category| category_name|state_code|
+-----------+--------+--------------+----------+
|        234|   state|           cn2|       sc2|
|        123| country|          null|      null|
|        456| country|United Kingdom|        UK|
+-----------+--------+--------------+----------+

Answer 2

您必须执行两个不同的连接，a.category == 'county'不能处于连接条件。

df_a.filter(df_a.category == 'county').join(df_b, df_a.category == df_b.county_geoid, "leftouter").join(df_c, 'state_fips', 'leftouter')

如何在pyspark中的条件下使用3个表连接？（多个表格）

2 个答案:

如何在pyspark中的条件下使用3个表连接？ （多个表格）

2 个答案:

如何在pyspark中的条件下使用3个表连接？（多个表格）