Azure DataBricks:如何对具有一对多关系的两个数据框进行内部联接,并从两个数据框中选择特定的列??

时间:2019-11-19 00:18:54

标签: python azure apache-spark databricks azure-databricks

我已通过以下方式从json文件中读取数据:

import os,shutil,glob,time
from pyspark.sql.functions import trim 

#Get Data DF1
df1 = spark.read.format("json").load("/mnt/coi/df1.json")

#Get Data DF2
df2 = spark.read.format("json").load("/mnt/coi/df2.json")

我正在合并数据并从两个DF中选择列,但最终结果不正确,并且没有所有数据:

df = df2.join(df1,df2.Number == df1.Number,how="inner").select(df1.abc,df2.xyz)

DF1 JSON具有唯一的 Number 列值

{"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0"} 
{"Number":80216884,"Type":"8","ID":2,"Code":"1010","abc":"MT"} 
{"Number":80216885,"Type":"8","ID":2,"Code":"1295","abc":"MS"} 
具有重复 Number 个值

DF2 JSON

{"Number":80216883,"DateTime":"2019-11-16","Year":2020,"Quarter":2,"xyz":5,"abc":"M0"}
{"Number":80216883,"DateTime":"2018-11-20","Year":2020,"Quarter":2,"xyz":5,"abc":"M0"}
{"Number":80216884,"DateTime":"2019-11-09","Year":2020,"Quarter":2,"xyz":5,"abc":"MT"}

我想要的结果是:{"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0","DateTime":"2018-11-16","Year":2020,"Quarter":2,"xyz":5} {"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0","DateTime":"2018-11-20","Year":2020,"Quarter":2,"xyz":5}

  

如何对具有一对多的两个数据帧进行内部联接   关系并从两个数据框中选择特定的列。   

当我加入两个DF中存在的某些 Number 时,它们在最终输出json中不存在。

在将零件文件合并为一个文件时,仅将最后一个零件文件复制到最终数据PFB代码中:

dfAll.write.format("json").save("/mnt/coi/DataModel")

#Read Part files
path = glob.glob("/dbfs/mnt/coi/DataModel/part-000*.json")


#Move file to FinalData folder in blbo
for file in path: 
      shutil.move(file,"/dbfs/mnt/coi/FinalData/FinalData.json")

1 个答案:

答案 0 :(得分:1)

要获得您期望的结果,考虑到您只希望将关系表1中的值设为多个,我的方法如下:

from pyspark.sql.functions import col

df = df2.join(df1,df2.Number == df1.Number,how="inner").select(df2.DateTime,df2.Number,df2.Quarter,df2.Year,df2.abc,df2.xyz)

df3 = df.groupBy("Number").count().filter(col("count")>1).select(df.Number)

df4=df3.join(df, df.Number == df3.Number,how="inner")

display(df4)

请告诉我是否有帮助。