Question

我在Stackoverflow上发现了类似的帖子。但是，我无法解决我的问题所以，这就是我写这篇文章的原因。

目标

目的是在加载SQL表（我使用SQL Server）时执行列投影[projection = filter columns]。

根据scala食谱，这是过滤colums [使用数组]的方法：

sqlContext.read.jdbc(url,"person",Array("gender='M'"),prop)

但是，我不想在我的Scala代码中硬编码数组（＆＃34; col1＆＃34;，＆＃34; col2＆＃34;，...）这就是为什么我使用带有类型安全的配置文件（见下文）。

配置文件

dataset {
    type = sql
    sql{
        url = "jdbc://host:port:user:name:password"
        tablename = "ClientShampooBusinesLimited"
        driver = "driver"
        other = "i have a lot of other single string elements in the config file..."
        columnList = [
        {
            colname = "id"
            colAlias = "identifient"
        }
        {
            colname = "name"
            colAlias = "nom client"
        }
        {
            colname = "age"
            colAlias = "âge client"
        }
        ]
    }
}

让我们专注于列列表＆＃39 ;: SQL列的名称可以对应于＆＃39; colname＆＃39;。＆＃39; colAlias＆＃39;是一个我将在以后使用的字段。

data.scala文件

lazy val columnList = configFromFile.getList("dataset.sql.columnList")
lazy val dbUrl = configFromFile.getList("dataset.sql.url")
lazy val DbTableName= configFromFile.getList("dataset.sql.tablename")
lazy val DriverName= configFromFile.getList("dataset.sql.driver")

configFromFile由我自己在另一个自定义类中创建。但这没关系。 columnList的类型是＆＃34; ConfigList＆＃34;这种类型来自类型安全。

主文件

def loadDataSQL(): DataFrame = {

val url = datasetConfig.dbUrl 
val dbTablename = datasetConfig.DbTableName
val dbDriver = datasetConfig.DriverName
val columns = // I need help to solve this


/* EDIT 2 march 2017
   This code should not be used. Have a look at the accepted answer.
*/
sparkSession.read.format("jdbc").options(
    Map("url" -> url,
    "dbtable" -> dbTablename,
    "predicates" -> columns,
    "driver" -> dbDriver))
    .load()
}

所以我所有的问题都是提取＆＃39; colnames＆＃39;值，以便将它们放在合适的数组中。有人可以帮助我写出正确的operhand of＆＃cu; val＆＃39; ？

由于

Answer 1

如果您正在寻找一种方法来将colname值列表读入Scala数组 - 我认为这样做：

import scala.collection.JavaConverters._

val columnList = configFromFile.getConfigList("dataset.sql.columnList")
val colNames: Array[String] = columnList.asScala.map(_.getString("colname")).toArray

使用提供的文件，这将导致Array(id, name, age)

修改：至于你的实际目标，我实际上不知道任何名为predication的选项（我也无法使用Spark 2.0.2在源代码中找到证据）。

JDBC Data Source根据在所使用的查询中选择的实际列执行“投影下推”。换句话说 - 只会从数据库中读取选定的列，因此您可以在创建DF后立即在colNames中使用select数组，例如：

import org.apache.spark.sql.functions._ sparkSession.read .format("jdbc") .options(Map("url" -> url, "dbtable" -> dbTablename, "driver" -> dbDriver)) .load() .select(colNames.map(col): _*) // selecting only desired columns

spark scala typesafe config安全迭代特定列名的值

1 个答案: