我有来自不同流媒体主题的以下json事件
customer= [
{
"Customer_id": "103",
"Customer_name": "Hari",
"email_address": "hari@gmail.com"
}]
product = [
{
"Customer_id": "103",
"product_id": " 205",
"product_name": "Books",
"product_Category": "Stationary"
}]
Sales= [
{
"customer_id": "103",
"line": {
"product_id": "205",
"purchase_time": "2017-08-19 12:17:55-0400",
"quantity": "2",
"unit_price": "25000"
},
"shipping_address": "Chennai"
}]
在以下用例中,流之上是另一个主题
我在DF下针对上述用例创建了
sales_schema = StructType([
StructField("Customer_id", StringType(), True),
StructField("Customer_name", StringType(), True),
StructField("email_address", StringType(), True),
StructField("product_Category", StringType(), True),
StructField("product_id", StringType(), True),
StructField("product_name", StringType(), True),
StructField("purchase_time", StringType(), True),
StructField("quantity", StringType(), True),
StructField("unit_price", StringType(), True),
StructField("shipping_address", StringType(), True)
]
)
cus_Topic=session.sparkContext.parallelize(customer)
sales_df = session.createDataFrame(cus_Topic,sales_schema)
product_topic = session.sparkContext.parallelize(product)
productDF = session.createDataFrame(product_topic)
sales_df.show()
|Customer_id|Customer_name| email_address|product_Category|product_id|product_name|purchase_time|quantity|unit_price|shipping_address|
103| Hari| hari@gmail.com| null| null| null| null| null| null| null|
104| Umesh|Umesh3@gmail.com| null| null| null| null| null| null| null|
productDF.show()
+-----------+----------------+----------+------------+
|Customer_id|product_Category|product_id|product_name|
+-----------+----------------+----------+------------+
| 103| Stationary| 205| Books|
| 104| Electronics| 206| Mobile|
+-----------+----------------+----------+------------+
现在我想基于customer_id合并此数据框
product_search_DF = sales_df.join(productDF, [sales_df.Customer_id==productDF.Customer_id], 'left_outer')
product_search_DF.show()
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
|Customer_id|Customer_name| email_address|product_Category|product_id|product_name|purchase_time|quantity|unit_price|shipping_address|Customer_id|product_Category|product_id|product_name|
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
| 104| Umesh|Umesh3@gmail.com| null| null| null| null| null| null| null| 104| Electronics| 206| Mobile|
| 103| Hari| hari@gmail.com| null| null| null| null| null| null| null| 103| Stationary| 205| Books|
+-----------+-------------+----------------+----------------+----------+------------+-------------+--------+----------+----------------+-----------+----------------+----------+------------+
但是它会使列重复
另外,我想看的是,从实时流媒体话题来看,来自客户,产品和销售的所有这些数据都应合并到单个数据框中
我也想知道实现此目标的正确方法。
感谢帮助。 谢谢