我们如何使用Spark in Java动态地将列名传递给SQL查询?
我尝试将SQL查询存储到字符串中,然后将此字符串作为参数传递:
SparkSession spark = SparkSession.builder().appName("Process").config("spark.master", "local").getOrCreate();
String file = "src/main/resources/in/test2.csv";
Path filePath = new Path("src/main/resources/in/test2.csv");
Dataset<Row> orderDataset = spark.read().format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.load(file);
String[] Header = orderDataset.toDF().columns();
List<String> headerlist = Arrays.asList(Header);
Properties connectionProperties = new Properties();
connectionProperties.put("user", "USER_NAME");
connectionProperties.put("password", "PASSWORD");
Dataset<Row> jdbcDF2 = spark.read()
.jdbc("jdbc:sqlserver://myserver", "(select dbcolumn_name,cfilecolumn_fieldname,t.DbTable_Name from wt_delivery.dbo.dbcolumn dbc inner join wt_delivery.dbo.dbtable t on dbc.DbTable_ID = t.DbTable_ID inner join wt_delivery.dbo.[column] c on dbc.DbColumn_ID=c.DbColumn_ID inner join wt_delivery.dbo.[CFilecolumn] cfc on c.Column_ID=cfc.Column_ID WHERE cfc.CFile_ID = 1461) as sq", connectionProperties);
jdbcDF2.createOrReplaceTempView("tfiledefinition");
Dataset<Row> results = spark.sql("SELECT dbcolumn_name FROM tfiledefinition");
orderDataset.foreach((ForeachFunction<Row>) row -> {
String[] fieldNames = row.schema().fieldNames();
for (String fieldName : fieldNames) {
int positionField = row.fieldIndex(fieldName);
Object valueField = row.get(positionField);
String query = "SELECT dbcolumn_name FROM tfiledefinition where cfilecolumn_fieldname = '" + fieldName + "'";
System.out.println("query: " + query);
Dataset<Row> tablename = spark.sql(query);
tablename.show();
}
}
);
}
我还尝试了这个指令,其中fieldname是动态值,但也产生相同的错误:
jdbcDF2.filter(col("cfilecolumn_fieldname").equalTo(fieldName)).select("DbTable_Name").show();
此代码不起作用;它会生成以下错误:
18/05/01 10:31:05 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:139)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:137)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
at com.databaseproject.DatabaseProject.lambda$1(DatabaseProject.java:144)
at org.apache.spark.sql.Dataset$$anonfun$foreach$2.apply(Dataset.scala:2666)
at org.apache.spark.sql.Dataset$$anonfun$foreach$2.apply(Dataset.scala:2666)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
有谁知道我该怎么做?
谢谢!