我已通过以下方式从json文件中读取数据:
import os,shutil,glob,time
from pyspark.sql.functions import trim
#Get Data DF1
df1 = spark.read.format("json").load("/mnt/coi/df1.json")
#Get Data DF2
df2 = spark.read.format("json").load("/mnt/coi/df2.json")
我正在合并数据并从两个DF中选择列,但最终结果不正确,并且没有所有数据:
df = df2.join(df1,df2.Number == df1.Number,how="inner").select(df1.abc,df2.xyz)
DF1 JSON具有唯一的 Number 列值
{"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0"}
{"Number":80216884,"Type":"8","ID":2,"Code":"1010","abc":"MT"}
{"Number":80216885,"Type":"8","ID":2,"Code":"1295","abc":"MS"}
具有重复 Number 个值的DF2 JSON
{"Number":80216883,"DateTime":"2019-11-16","Year":2020,"Quarter":2,"xyz":5,"abc":"M0"}
{"Number":80216883,"DateTime":"2018-11-20","Year":2020,"Quarter":2,"xyz":5,"abc":"M0"}
{"Number":80216884,"DateTime":"2019-11-09","Year":2020,"Quarter":2,"xyz":5,"abc":"MT"}
我想要的结果是:{"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0","DateTime":"2018-11-16","Year":2020,"Quarter":2,"xyz":5} {"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0","DateTime":"2018-11-20","Year":2020,"Quarter":2,"xyz":5}
如何对具有一对多的两个数据帧进行内部联接 关系并从两个数据框中选择特定的列。
当我加入两个DF中存在的某些 Number 时,它们在最终输出json中不存在。
在将零件文件合并为一个文件时,仅将最后一个零件文件复制到最终数据PFB代码中:
dfAll.write.format("json").save("/mnt/coi/DataModel")
#Read Part files
path = glob.glob("/dbfs/mnt/coi/DataModel/part-000*.json")
#Move file to FinalData folder in blbo
for file in path:
shutil.move(file,"/dbfs/mnt/coi/FinalData/FinalData.json")
答案 0 :(得分:1)
要获得您期望的结果,考虑到您只希望将关系表1中的值设为多个,我的方法如下:
from pyspark.sql.functions import col
df = df2.join(df1,df2.Number == df1.Number,how="inner").select(df2.DateTime,df2.Number,df2.Quarter,df2.Year,df2.abc,df2.xyz)
df3 = df.groupBy("Number").count().filter(col("count")>1).select(df.Number)
df4=df3.join(df, df.Number == df3.Number,how="inner")
display(df4)
请告诉我是否有帮助。