我在Java 1.8中使用spark-sql-2.4.1v。 我有两个数据集。
Dataset<Company> firstDataset = //get/read data from oracle company table.
Dataset<CompanyTransaction> secondDataset = //get/read data from oracle company_transaction table.
Company
的列类似"companyId","companyName","companyRegion","column4","column5",...etc
CompanyTransaction
的列类似"companyId","transactionId","transactionType","column4","column5",...etc
对于firstDataset中的每个companyId,我需要从CompanyTransaction中获取相应的companyId数据
如何使用spark来实现?
答案 0 :(得分:3)
根据company_id加入两个数据集,然后从第二个数据集中选择所有列。代码应如下所示:(未测试)
Dataset<Row> finalDf = firstDataset.join(secondDataset
,firstDataset.col("companyId").equalTo(secondDataset.col("companyid"),
"inner").select(secondDataset .col("*))
finalDF.show()