我正在使用 Java 进行测试,我将从api下载数据并与mongodb数据进行比较,而下载的json具有15-20个字段,而数据库具有300个字段。
现在,我的任务是将下载的json与mongodb数据进行比较,并获取与过去的数据相关的任何字段。
从API下载的数据
StudentId,Name,Phone,Email
1,tony,123,a@g.com
2,stark,456,b@g.com
3,spidy,789,c@g.com
Mongodb数据
StudentId,Name,Phone,Email,State,City
1,tony,1234,a@g.com,NY,Nowhere
2,stark,456,bg@g.com,NY,Nowhere
3,spidy,789,c@g.com,OH,Nowhere
由于列长,我不能使用except。
预期产量
StudentId,Name,Phone,Email,Past_Phone,Past_Email
1,tony,1234,a@g.com,1234, //phone number only changed
2,stark,456,b@g.com,,bg@g.com //Email only changed
3,spidy,789,c@g.com,,
答案 0 :(得分:1)
请考虑您的数据位于2个数据框中。我们可以为其创建临时视图,如下所示,
api_df.createOrReplaceTempView("api_data")
mongo_df.createOrReplaceTempView("mongo_data")
接下来,我们可以使用Spark SQL。在这里,我们使用StudentId
列将这两个视图合并,然后在它们之上使用case语句来计算过去的电话号码和电子邮件。
spark.sql("""
select a.*
, case when a.Phone = b.Phone then '' else b.Phone end as Past_phone
, case when a.Email = b.Email then '' else b.Email end as Past_Email
from api_data a
join mongo_data b
on a.StudentId = b.StudentId
order by a.StudentId""").show()
输出:
+---------+-----+-----+-------+----------+----------+
|StudentId| Name|Phone| Email|Past_phone|Past_Email|
+---------+-----+-----+-------+----------+----------+
| 1| tony| 123|a@g.com| 1234| |
| 2|stark| 456|b@g.com| | bg@g.com|
| 3|spidy| 789|c@g.com| | |
+---------+-----+-----+-------+----------+----------+
答案 1 :(得分:0)
请找到以下相同的源代码。这里我以唯一的电话号码条件为例。
write.csv2(Nits_hotel, "Nits_hotel.csv")
答案 2 :(得分:0)
我们有:
df1.show
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a@g.com|
| 2| stark| 456|b@g.com|
| 3| spidy| 789|c@g.com|
+-----------+------+-------+-------+
df2.show
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a@g.com| NY|Nowhere|
| 2| stark| 456|bg@g.com| NY|Nowhere|
| 3| spidy| 789| c@g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
加入后:
var jn = df2.join(df1,df1("StudentId_1")===df2("StudentId_2"))
然后
var ans = jn.withColumn("Past_Phone", when(jn("Phone_2").notEqual(jn("Phone_1")),jn("Phone_1")).otherwise("")).withColumn("Past_Email", when(jn("Email_2").notEqual(jn("Email_1")),jn("Email_1")).otherwise(""))
参考:Spark: Add column to dataframe conditionally
下一步:
ans.select(ans("StudentId_2") as "StudentId",ans("Name_2") as "Name",ans("Phone_2") as "Phone",ans("Email_2") as "Email",ans("Past_Email"),ans("Past_Phone")).show
+---------+-----+-----+--------+----------+----------+
|StudentId| Name|Phone| Email|Past_Email|Past_Phone|
+---------+-----+-----+--------+----------+----------+
| 1| tony| 1234| a@g.com| | 123|
| 2|stark| 456|bg@g.com| b@g.com| |
| 3|spidy| 789| c@g.com| | |
+---------+-----+-----+--------+----------+----------+