已全部编辑
因此,我想对两个数据集进行内部联接。第一个(metlistafter)是:
+------------------------------------+-------------+---------+
| experimentid| description|intensity|
+------------------------------------+-------------+---------+
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 77.4946|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 14.6063|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 30.593|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 76.2769|
+------------------------------------+-------------+---------+
第二个(metlistbeforetemp)是:
+------------------------------------+-------------+---------+
| experimentid| description|intensity|
+------------------------------------+-------------+---------+
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 124.379|
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 175.656|
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 184.736|
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 58.0333|
|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 205.899|
+------------------------------------+-------------+---------+
这是我想要的(总共4行):
+------------------------------------+-------------+---------+-------------+---------+
| experimentid| description|intensity| description|intensity|
+------------------------------------+-------------+---------+-------------+---------+
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 77.4946|1_3-butadiene| 124.379|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 14.6063|1_3-butadiene| 175.656|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 30.593|1_3-butadiene| 184.736|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 76.2769|1_3-butadiene| 58.0333|
+------------------------------------+-------------+---------+-------------+---------+
但是,我得到的是交叉连接结果! (总共4x5 = 20行):
+------------------------------------+-------------+---------+------------------------------------+-------------+---------+
| experimentid| description|intensity| experimentid| description|intensity|
+------------------------------------+-------------+---------+------------------------------------+-------------+---------+
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 124.379|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 175.656|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 184.736|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 58.0333|
|231d4040-d486-4e8b-937a-aebe645ef8ae|Palmitic acid| 77.4946|231d4040-d486-4e8b-937a-aebe645ef8ae|1_3-butadiene| 205.899|
...(more rows)
我的完整代码是这样:
Dataset<Row> metlistafter = sp.emptyDataFrame();
Dataset<Row> metlistinitial = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mdb");
put("table", "experiment");
}
})
.load().select(col("experimentid"), col("description"), col("intensity")).filter(col("experimentid").isin(experimentlist.toArray())).filter(col("description").isin(metabolitelist.toArray()));
for(int iexp=0;iexp<experimentlist.size();iexp++){
for(int imet=0;imet<metabolitelist.size();imet++){
Dataset<Row> metlistbeforetemp = metlistinitial.select(col("experimentid").alias("experimentid"), col("description").alias("description"),col("intensity").alias("intensity")).filter(col("experimentid").isin(experimentlist.get(iexp))).filter(col("description").isin(metabolitelist.get(imet)));
if(imet==0){
metlistafter = metlistbeforetemp.select(col("experimentid"), col("description"),col("intensity"));
}else{
metlistafter=metlistafter.join(metlistbeforetemp,metlistafter.col("experimentid").equalTo(metlistbeforetemp.col("experimentid")),"inner");//.where(metlistafter.col("experimentid").equalTo(metlistbeforetemp.col("experimentid")));
System.out.println("result "+metlistafter.count());metlistafter.show();
}
}
}
因此,基本上,我想连续地将metlistbeforetemp数据集中的列添加到metlistafter数据集中!因此,例如,如果metlistbeforetemp.count()为25且包含3列,而metlistafter.count()为22且还包含3列,那么我想将这两者合并并将结果分配给metlistafter。因此,我将拥有5列的metlistafter.count()= 22!非常抱歉。我真的在尽力解释最好的情况!