使用Java合并spark数据集中的两列

时间:2017-04-07 14:38:14

标签: java apache-spark-sql apache-spark-dataset

我想在apache spark数据集中合并两列。 我试过以下但是没有用,有人可以提出解决方案吗?

        Dataset<Row> df1 = spark.read().json("src/test/resources/DataSets/DataSet.json");
        Dataset<Row> df1Map = df1.select(functions.array("beginTime", "endTime"));
    df1Map.show();

输入DataSet.json如下:

{"IPsrc":"abc", "IPdst":"def", "beginTime": 1, "endTime":1}
{"IPsrc":"def", "IPdst":"abc", "beginTime": 2, "endTime":2}
{"IPsrc":"abc", "IPdst":"def", "beginTime": 3, "endTime":3}
{"IPsrc":"def", "IPdst":"abc", "beginTime": 4, "endTime":4}

请注意我在 JAVA 8

中这样做

以上导致错误如下:

com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'beginTime': was expecting ('true', 'false' or 'null')
at [Source: beginTime; line: 1, column: 19]

at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2462)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1621)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:689)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3776)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50)
at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
at org.apache.spark.sql.types.DataType.fromJson(DataType.scala)

df1.show()的初始表输出如下:

+-----+-----+---------+-------+
|IPdst|IPsrc|beginTime|endTime|
+-----+-----+---------+-------+
|  def|  abc|        1|      1|
|  abc|  def|        2|      2|
|  def|  abc|        3|      3|
|  abc|  def|        4|      4|
+-----+-----+---------+-------+

df1的架构如下:

root
 |-- IPdst: string (nullable = true)
 |-- IPsrc: string (nullable = true)
 |-- beginTime: long (nullable = true)
 |-- endTime: long (nullable = true)

2 个答案:

答案 0 :(得分:0)

如果要连接两列,可以执行以下操作:

Dataset<Row> df1Map = df1.select(functions.concat(df1.col("beginTime"), df1.col("endTime")));
df1Map.show();

编辑: 我试过这个并且它有效: 这是我的代码:

SparkSession spark = SparkSession
            .builder()
            .master("local")
            .appName("SO")
            .getOrCreate();

Dataset<Row> df1 = spark.read().json("src/main/resources/json/Dataset.json");
df1.printSchema();
df1.show();

Dataset<Row> df1Map = df1.select(functions.array("beginTime", "endTime"));
df1Map.show();

答案 1 :(得分:0)

你可以试试这个

Dataset<Row> df1Map = df1.withColumn("begin_end_time", concat(col("beginTime"), lit(" - "), (col("endTime")) );