我是scala的新手。我需要连接到数据库并从表“ queue”中选择一个名为“ queue_message”的列。此列包含一个json模式:
{"LOG_ID":"2442204","CUSTOMER_CODE":"79D3QL","CFILE_WEIGHT":"1","PROVIDER_ID":"","FILETYPE_DIRECTORYFROM":"\\FromCustomer","FILE_CHARSET":"","CFILE_FORMAT":"CSV","FILE_NAME":"1475_18032018T164840_1.csv","FILETYPE_LABEL":"Order","FILE_ID":1475,"FILEFORMAT_CODE":"","CUSTOMER_ID":1016,"FILE_MASK":"wt_cde_*-*_*.csv"}
我需要在scala中(或在Java中作为第二个选项)反序列化此列,然后将另一个结构序列化为json格式。
这是我在scala中的代码:
package com.orienit.spark.training.sparkexamples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import java.sql.DriverManager
import com.microsoft.sqlserver.jdbc
import org.apache.spark.rdd.JdbcRDD
import java.sql.ResultSet
object WordCount {
def main(args: Array[String]){
val conf = new SparkConf()
.setAppName("my first scala App")
.setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val url = "jdbc:sqlserver://localhost:1433;user=xsxx;password=xxx;databaseName=xxx"
val df = sqlContext
.read
.format("jdbc")
.option("url",url)
.option("dbtable","(select top 1 queue_message from mq..queue where queuename_id = 4 order by queue_id desc) as sq")
.load()
df.show()
println( df.collectAsList())
}
}
这些是我在scala项目的maven pom.xml中使用的依赖项:
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
这是我在Java中的代码:
package com.orienit.spark.training.javaJdbcConnectivity;
import java.util.HashMap;
import java.util.Map;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
public class WordCount {
public static void main(String[] args) {
// TODO Auto-generated method stub
SparkConf conf = new SparkConf().setMaster("local").setAppName("My app");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
Map<String, String> options = new HashMap<String, String>();
options.put("url", "jdbc:sqlserver://localhost:1433;user=xsxx;password=xxx;databaseName=xxx");
options.put("dbtable", "(select top 1 queue_message from mq..queue where queuename_id = 4 order by queue_id desc) as sq");
Dataset<Row> df = sqlContext.read().format("jdbc"). options(options).load();
df.show();
System.out.println(df.collectAsList());
System.out.println(df.toJSON());
}
}
这些是我的Java项目的依赖项
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
谁能帮助我,请序列化为json格式并从json格式反序列化,或者给我有关此主题的任何相关文档。我没有在Spark官方文档中找到任何对这次行动之王有用的东西。
非常感谢
答案 0 :(得分:0)
您可以在spark中使用from_json函数。
假设json的模式为“ schema”,那么您可以简单地执行以下操作:
import org.apache.spark.sql.functions.from_json
df.withColumn("deserialized", from_json($"queue_message", schema)