Spark Java API:联接两个数据集

时间:2018-07-11 13:00:53

标签: java apache-spark apache-spark-sql hdfs

我有两个数据集,我想加入它们,但仅获取第一个数据集的数据。

我想选择Ds1和ds2中存在但仅显示(帐户和金额1)的Datsets。

我的数据集是这样的: DS1

+---------+------------+
|  account|    amount1 |
+---------+------------+
| aaaaaa  |   1000     |
| bbbbbb  |   4000     |
| cccccc  |   5000     |
| cccccc  |   5000     |

DS2

 +---------+------------+------------+
    |  account|    amount2 |    amount3 |
    +---------+------------+------------+
    | bbbbbb  |   4000     |   4000     |
    | cccccc  |   5000     |   5000     |

我想获取这个数据集

+---------+------------+
|  account|    amount1 |
+---------+------------+
| aaaaaa  |   1000     |
| cccccc  |   5000     |
| cccccc  |   5000     |

有人可以指导我使用Spark Java API中的示例表达式来执行此操作吗?预先感谢。

1 个答案:

答案 0 :(得分:0)

        val ds1 = Seq(
              ("aaaaaa","1000"), 
              ("bbbbbb","4000"),
              ("cccccc","5000"), 
              ("cccccc","5000")
               ).toDF("account", "amount1")


    ds1.show()



    +-------+-------+
    |account|amount1|
    +-------+-------+
    | aaaaaa|   1000|
    | bbbbbb|   4000|
    | cccccc|   5000|
    | cccccc|   5000|
    +-------+-------+



        val ds2 = Seq(
              ("bbbbbb","4000","4000"),
              ("cccccc","5000","5000")
               ).toDF("account", "amount2","amount3")



        ds2.show()



+-------+-------+-------+
|account|amount2|amount3|
+-------+-------+-------+
| bbbbbb|   4000|   4000|
| cccccc|   5000|   5000|
| cccccc|   5000|   5000|
+-------+-------+-------+


        ds1.createOrReplaceTempView("table_1")
        ds2.createOrReplaceTempView("table_2")
        //Cross join
        //spark.conf.set("spark.sql.crossJoin.enabled", "true")

        // inner join
        sqlContext.sql("SELECT table_1.account,table_1.amount1 FROM table_1 INNER JOIN table_2 ON table_1.account = table_2.account order by table_1.account").show

      +-------+-------+
    |account|amount1|
    +-------+-------+
    | bbbbbb|   4000|
    | cccccc|   5000|
    | cccccc|   5000|
    +-------+-------+


sqlContext.sql("SELECT table_1.account,table_2.amount2,table_2.amount3 FROM table_1 INNER JOIN table_2 ON table_1.account = table_2.account order by table_1.account").show


    +-------+-------+-------+
    |account|amount2|amount3|
    +-------+-------+-------+
    | bbbbbb|   4000|   4000|
    | cccccc|   5000|   5000|
    +-------+-------+-------+





    ds1: org.apache.spark.sql.DataFrame = [account: string, amount1: string]
    ds2: org.apache.spark.sql.DataFrame = [account: string, amount2: string ... 1 more field]