在java中组合数据帧

时间:2016-11-07 04:38:00

标签: java apache-spark

如何将数据框与单列(描述)与另一个具有2列(名称,标题)的数据框合并,以便我的结果数据框将包含3列(名称,标题,描述)

1 个答案:

答案 0 :(得分:0)

我在scala中提供解决方案。现在我可以将其添加为注释,但对于我附加的格式和图像,提供此作为答案。我非常确定Java中必须有一个等效的

val nameCaptionDataFrame = Seq(("name1","caption1"),("name2","caption2"),("name3","caption3"),("name4","caption4")).toDF("name","caption")
val descriptionDataFrame = List("desc1","desc2","desc3","desc4").toDF("description")
val nameCaptionDataFrameWithId = nameCaptionDataFrame.withColumn("nameId",monotonically_increasing_id())
nameCaptionDataFrameWithId.show
val descriptionDataFrameId = descriptionDataFrame.withColumn("descId",monotonically_increasing_id())
descriptionDataFrameId.show
nameCaptionDataFrameWithId.join(descriptionDataFrameId, nameCaptionDataFrameWithId.col("nameId") === descriptionDataFrameId.col("descId")).show

以下是此段代码的示例输出。我希望你能从这里接受这个想法(我认为API是一致的)并用Java做到这一点

enter image description here

** JAVA的编辑** A"翻译"代码看起来与此类似。

/**
 * Created by RGOVIND on 11/8/2016.
 */

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.*;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class SparkMain {
    static public void main(String[] args) {
        SparkConf conf = new SparkConf().setMaster("local").setAppName("Stack Overflow App");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);

        List<Tuple2<String, String>> tuples = new ArrayList<Tuple2<String, String>>();
        tuples.add(new Tuple2<String, String>("name1", "caption1"));
        tuples.add(new Tuple2<String, String>("name3", "caption2"));
        tuples.add(new Tuple2<String, String>("name3", "caption3"));

        List<String> descriptions = Arrays.asList(new String[]{"desc1" , "desc2" , "desc3"});

        Encoder<Tuple2<String, String>> nameCaptionEncoder = Encoders.tuple(Encoders.STRING(), Encoders.STRING());
        Dataset<Tuple2<String, String>> nameValueDataSet = sqlContext.createDataset(tuples, nameCaptionEncoder);
        Dataset<String> descriptionDataSet = sqlContext.createDataset(descriptions, Encoders.STRING());
        Dataset<Row> nameValueDataSetWithId = nameValueDataSet.toDF("name","caption").withColumn("id",functions.monotonically_increasing_id()).select("*");
        Dataset<Row> descriptionDataSetId = descriptionDataSet.withColumn("id",functions.monotonically_increasing_id()).select("*");
        nameValueDataSetWithId.join(descriptionDataSetId ,"id").show();
    }
}

这打印如下。希望这有帮助

enter image description here