Pyspark-如何根据条件将多个json流合并到单个数据帧?

时间:2020-10-28 04:16:07

标签: apache-spark pyspark apache-spark-sql spark-streaming pyspark-dataframes

我有来自不同流媒体主题的以下json事件

customer= [

    {
        "Customer_id": "103",
        "Customer_name": "Hari",
        "email_address": "hari@gmail.com"
    }]

product = [
  {
    "Customer_id": "103",
    "product_id": " 205",
    "product_name": "Books",
    "product_Category": "Stationary"
  }]

Sales= [
  {
    "customer_id": "103",
    "line": {
      "product_id": "205",
      "purchase_time": "2017-08-19 12:17:55-0400",
      "quantity": "2",
      "unit_price": "25000"
    },
    "shipping_address": "Chennai"
  }]

在以下用例中,流之上是另一个主题

  1. 当用户登录到电子门户时-customer.json
  2. 用户搜索产品时-product.json
  3. 用户签出产品时-sales.json

我在DF下针对上述用例创建了

sales_schema = StructType([
    StructField("Customer_id", StringType(), True),
    StructField("Customer_name", StringType(), True),
    StructField("email_address", StringType(), True),
    StructField("product_Category", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("product_name", StringType(), True),
    StructField("purchase_time", StringType(), True),
    StructField("quantity", StringType(), True),
    StructField("unit_price", StringType(), True),
    StructField("shipping_address", StringType(), True)
   ]
)
cus_Topic=session.sparkContext.parallelize(customer)
sales_df = session.createDataFrame(cus_Topic,sales_schema)

product_topic = session.sparkContext.parallelize(product)
productDF = session.createDataFrame(product_topic)

sales_df.show()

|Customer_id|Customer_name|   email_address|product_Category|product_id|product_name|purchase_time|quantity|unit_price|shipping_address|
         103|         Hari|  hari@gmail.com|            null|      null|        null|         null|    null|      null|            null|
         104|        Umesh|Umesh3@gmail.com|            null|      null|        null|         null|    null|      null|            null|


productDF.show()

+-----------+----------------+----------+------------+
|Customer_id|product_Category|product_id|product_name|
+-----------+----------------+----------+------------+
|        103|      Stationary|       205|       Books|
|        104|     Electronics|       206|      Mobile|
+-----------+----------------+----------+------------+

现在我想基于customer_id合并此数据框

product_search_DF = sales_df.join(productDF, [sales_df.Customer_id==productDF.Customer_id], 'left_outer')
product_search_DF.show()

+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
|Customer_id|Customer_name|   email_address|product_Category|product_id|product_name|purchase_time|quantity|unit_price|shipping_address|Customer_id|product_Category|product_id|product_name|
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
|        104|        Umesh|Umesh3@gmail.com|            null|      null|        null|         null|    null|      null|            null|        104|     Electronics|       206|      Mobile|
|        103|         Hari|  hari@gmail.com|            null|      null|        null|         null|    null|      null|            null|        103|      Stationary|       205|       Books|
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+

但是它会使列重复

另外,我想看的是,从实时流媒体话题来看,来自客户,产品和销售的所有这些数据都应合并到单个数据框中

我也想知道实现此目标的正确方法。

感谢帮助。 谢谢

0 个答案:

没有答案