动态spark sql查询

时间:2018-05-01 08:33:54

标签: java apache-spark apache-spark-sql spark-dataframe

我们如何使用Spark in Java动态地将列名传递给SQL查询?

我尝试将SQL查询存储到字符串中,然后将此字符串作为参数传递:

SparkSession spark = SparkSession.builder().appName("Process").config("spark.master", "local").getOrCreate();

String file = "src/main/resources/in/test2.csv"; 
Path filePath = new Path("src/main/resources/in/test2.csv");

Dataset<Row> orderDataset = spark.read().format("csv")
    .option("sep", ";")
    .option("inferSchema", "true")
    .option("header", "true")
    .load(file);
String[] Header = orderDataset.toDF().columns();
List<String> headerlist = Arrays.asList(Header);

Properties connectionProperties = new Properties();
connectionProperties.put("user", "USER_NAME");
connectionProperties.put("password", "PASSWORD");
Dataset<Row> jdbcDF2 = spark.read()  
     .jdbc("jdbc:sqlserver://myserver", "(select dbcolumn_name,cfilecolumn_fieldname,t.DbTable_Name from wt_delivery.dbo.dbcolumn dbc  inner join wt_delivery.dbo.dbtable t on dbc.DbTable_ID = t.DbTable_ID inner join wt_delivery.dbo.[column] c on dbc.DbColumn_ID=c.DbColumn_ID inner join wt_delivery.dbo.[CFilecolumn] cfc on c.Column_ID=cfc.Column_ID WHERE cfc.CFile_ID = 1461) as sq", connectionProperties);


jdbcDF2.createOrReplaceTempView("tfiledefinition");
Dataset<Row> results = spark.sql("SELECT dbcolumn_name FROM tfiledefinition");

orderDataset.foreach((ForeachFunction<Row>) row -> {

  String[] fieldNames = row.schema().fieldNames();
  for (String fieldName : fieldNames) {

    int positionField = row.fieldIndex(fieldName);

    Object valueField = row.get(positionField);


    String query = "SELECT dbcolumn_name FROM tfiledefinition where cfilecolumn_fieldname = '" + fieldName + "'";
    System.out.println("query: " + query);
    Dataset<Row> tablename = spark.sql(query);
    tablename.show();
  }
}
);
}

我还尝试了这个指令,其中fieldname是动态值,但也产生相同的错误:

jdbcDF2.filter(col("cfilecolumn_fieldname").equalTo(fieldName)).select("DbTable_Name").show();

此代码不起作用;它会生成以下错误:

    18/05/01 10:31:05 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
java.lang.NullPointerException
    at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:139)
    at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:137)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
    at com.databaseproject.DatabaseProject.lambda$1(DatabaseProject.java:144)
    at org.apache.spark.sql.Dataset$$anonfun$foreach$2.apply(Dataset.scala:2666)
    at org.apache.spark.sql.Dataset$$anonfun$foreach$2.apply(Dataset.scala:2666)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

有谁知道我该怎么做?

谢谢!

0 个答案:

没有答案