PySpark-有一种方法可以水平连接两个数据帧,以便第一个df中的每一行都具有第二个df中的所有行

时间:2018-09-08 18:29:28

标签: apache-spark pyspark apache-spark-sql pyspark-sql

因此,我有一个具有唯一user_id的用户df和另一个具有一组问题的df。然后,我想合并dfs,以便将每个user_id附加到全套问题:

用户Df:

+--------------------------+
|user_id                   |
+--------------------------+
|GDDVWWIOOKDY4WWBCICM4VOQHQ|
|77VC23NYEWLGHVVS4UMHJEVESU|
|VCOX7HUHTMPFCUOGYWGL4DMIRI|
|XPJBJMABYXLTZCKSONJVBCOXQM|
|QHTPQSFNOA5YEWH6N7FREBMMDM|
|JLQNBYCSC4DGCOHNLRBK5UANWI|
|RWYUOLBKIQMZVYHZJYCQ7SGTKA|
|CR33NGPK2GKK6G35SLZB7TGIJE|
|N6K7URSGH65T5UT6PZHMN62E2U|
|SZMPG3FQQOHGDV23UVXODTQETE|
+--------------------------+

问题Df

+--------------------+-------------------+-----------------+--------------------+
|       category_type|   category_subject|      question_id|            question|
+--------------------+-------------------+-----------------+--------------------+
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|        Demographics|Social Demographics|pdl_ethnicity_new|           Ethnicity|
|        Demographics|Social Demographics|pdl_ethnicity_new|           Ethnicity|
|        Demographics|Social Demographics|pdl_ethnicity_new|           Ethnicity|
+--------------------+-------------------+-----------------+--------------------+

因此,目前我将user_ids转换为列表,并遍历它们,在问题df上创建新列,并根据结果创建临时df。然后,我将其合并到最终的df中,以按照以下方式保存该user_id迭代的结果:

创建user_id列表:

unique_users_list = users_df \
  .select("user_id") \
  .agg(f.collect_list('user_id')).collect()[0][0]

创建一个空的最终df以附加到:

finaldf_schema = StructType([
    StructField("category_type", StringType(), False),
    StructField("category_subject", StringType(), False),
    StructField("question_id", StringType(), False),
    StructField("question", StringType(), False),
    StructField("user_id", StringType(), False)
])

final_df = spark.createDataFrame([], finaldf_schema)

然后遍历user_id并合并到问题df:

for user_id in unique_users_list:
   temp_df = questions_df.withColumn("user_id", f.lit(user_id))
   final_df = final_df.union(temp_df)

但是,我发现性能非常慢。请问有没有更有效,更快捷的方法。

谢谢

1 个答案:

答案 0 :(得分:1)

您要寻找的东西称为笛卡尔积。您可以使用pyspark.sql.DataFrame.crossJoin()来实现:

尝试:

final_df = users_df.crossJoin(questions_df)