如何处理这种情况?

时间:2019-12-24 08:21:24

标签: java apache-spark apache-spark-sql

我在Java 1.8中使用spark-sql-2.4.1v。 我有两个数据集。

Dataset<Company> firstDataset = //get/read data from oracle company table.


Dataset<CompanyTransaction> secondDataset = //get/read data from oracle company_transaction table.

Company的列类似"companyId","companyName","companyRegion","column4","column5",...etc

CompanyTransaction的列类似"companyId","transactionId","transactionType","column4","column5",...etc

对于firstDataset中的每个companyId,我需要从CompanyTransaction中获取相应的companyId数据

如何使用spark来实现?

1 个答案:

答案 0 :(得分:3)

根据company_id加入两个数据集,然后从第二个数据集中选择所有列。代码应如下所示:(未测试)

Dataset<Row> finalDf = firstDataset.join(secondDataset 
,firstDataset.col("companyId").equalTo(secondDataset.col("companyid"), 
"inner").select(secondDataset .col("*))
finalDF.show()