SparkSQL:内部联接返回交叉联接

时间:2018-10-14 04:21:04

标签: java apache-spark-sql inner-join

已全部编辑

因此,我想对两个数据集进行内部联接。第一个(metlistafter)是:

+------------------------------------+-------------+---------+
|                        experimentid|  description|intensity|
+------------------------------------+-------------+---------+
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  77.4946|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  14.6063|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|   30.593|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  76.2769|
+------------------------------------+-------------+---------+

第二个(metlistbeforetemp)是:

+------------------------------------+-------------+---------+
|                        experimentid|  description|intensity|
+------------------------------------+-------------+---------+
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  124.379|
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  175.656|
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  184.736|
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  58.0333|
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  205.899|
+------------------------------------+-------------+---------+

这是我想要的(总共4行):

+------------------------------------+-------------+---------+-------------+---------+
|                        experimentid|  description|intensity|  description|intensity|
+------------------------------------+-------------+---------+-------------+---------+
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  77.4946|1_3-butadiene|  124.379|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  14.6063|1_3-butadiene|  175.656|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|   30.593|1_3-butadiene|  184.736|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  76.2769|1_3-butadiene|  58.0333|
+------------------------------------+-------------+---------+-------------+---------+

但是,我得到的是交叉连接结果! (总共4x5 = 20行):

+------------------------------------+-------------+---------+------------------------------------+-------------+---------+
|                        experimentid|  description|intensity|                        experimentid|  description|intensity|
+------------------------------------+-------------+---------+------------------------------------+-------------+---------+
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  124.379|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  175.656|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  184.736|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  58.0333|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid|  77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene|  205.899|
...(more rows)

我的完整代码是这样:

Dataset<Row> metlistafter = sp.emptyDataFrame();            
        Dataset<Row> metlistinitial = sp.read().format("org.apache.spark.sql.cassandra")
            .options(new HashMap<String, String>() {
                {
                    put("keyspace", "mdb");
                    put("table", "experiment");
                }
            })
            .load().select(col("experimentid"), col("description"), col("intensity")).filter(col("experimentid").isin(experimentlist.toArray())).filter(col("description").isin(metabolitelist.toArray()));

        for(int iexp=0;iexp<experimentlist.size();iexp++){
            for(int imet=0;imet<metabolitelist.size();imet++){
                Dataset<Row> metlistbeforetemp = metlistinitial.select(col("experimentid").alias("experimentid"), col("description").alias("description"),col("intensity").alias("intensity")).filter(col("experimentid").isin(experimentlist.get(iexp))).filter(col("description").isin(metabolitelist.get(imet)));
                if(imet==0){
                    metlistafter = metlistbeforetemp.select(col("experimentid"), col("description"),col("intensity"));
                }else{
                    metlistafter=metlistafter.join(metlistbeforetemp,metlistafter.col("experimentid").equalTo(metlistbeforetemp.col("experimentid")),"inner");//.where(metlistafter.col("experimentid").equalTo(metlistbeforetemp.col("experimentid")));
                    System.out.println("result "+metlistafter.count());metlistafter.show();
                }
            }
        }

因此,基本上,我想连续地将metlistbeforetemp数据集中的列添加到metlistafter数据集中!因此,例如,如果metlistbeforetemp.count()为25且包含3列,而metlistafter.count()为22且还包含3列,那么我想将这两者合并并将结果分配给metlistafter。因此,我将拥有5列的metlistafter.count()= 22!非常抱歉。我真的在尽力解释最好的情况!

0 个答案:

没有答案