我有两个数据集,我想加入它们,但仅获取第一个数据集的数据。
我想选择Ds1和ds2中存在但仅显示(帐户和金额1)的Datsets。
我的数据集是这样的: DS1
+---------+------------+
| account| amount1 |
+---------+------------+
| aaaaaa | 1000 |
| bbbbbb | 4000 |
| cccccc | 5000 |
| cccccc | 5000 |
DS2
+---------+------------+------------+
| account| amount2 | amount3 |
+---------+------------+------------+
| bbbbbb | 4000 | 4000 |
| cccccc | 5000 | 5000 |
我想获取这个数据集
+---------+------------+
| account| amount1 |
+---------+------------+
| aaaaaa | 1000 |
| cccccc | 5000 |
| cccccc | 5000 |
有人可以指导我使用Spark Java API中的示例表达式来执行此操作吗?预先感谢。
答案 0 :(得分:0)
val ds1 = Seq(
("aaaaaa","1000"),
("bbbbbb","4000"),
("cccccc","5000"),
("cccccc","5000")
).toDF("account", "amount1")
ds1.show()
+-------+-------+
|account|amount1|
+-------+-------+
| aaaaaa| 1000|
| bbbbbb| 4000|
| cccccc| 5000|
| cccccc| 5000|
+-------+-------+
val ds2 = Seq(
("bbbbbb","4000","4000"),
("cccccc","5000","5000")
).toDF("account", "amount2","amount3")
ds2.show()
+-------+-------+-------+
|account|amount2|amount3|
+-------+-------+-------+
| bbbbbb| 4000| 4000|
| cccccc| 5000| 5000|
| cccccc| 5000| 5000|
+-------+-------+-------+
ds1.createOrReplaceTempView("table_1")
ds2.createOrReplaceTempView("table_2")
//Cross join
//spark.conf.set("spark.sql.crossJoin.enabled", "true")
// inner join
sqlContext.sql("SELECT table_1.account,table_1.amount1 FROM table_1 INNER JOIN table_2 ON table_1.account = table_2.account order by table_1.account").show
+-------+-------+
|account|amount1|
+-------+-------+
| bbbbbb| 4000|
| cccccc| 5000|
| cccccc| 5000|
+-------+-------+
sqlContext.sql("SELECT table_1.account,table_2.amount2,table_2.amount3 FROM table_1 INNER JOIN table_2 ON table_1.account = table_2.account order by table_1.account").show
+-------+-------+-------+
|account|amount2|amount3|
+-------+-------+-------+
| bbbbbb| 4000| 4000|
| cccccc| 5000| 5000|
+-------+-------+-------+
ds1: org.apache.spark.sql.DataFrame = [account: string, amount1: string]
ds2: org.apache.spark.sql.DataFrame = [account: string, amount2: string ... 1 more field]