Question

我在Windows 10上运行Spark 2.1.0。我连接到MySQL数据库，使用JDBC将数据转换为spark。如下所示，每当我执行一个动作时，我都会收到以下警告，这让我想知道每次操作都会从数据库中检索数据。

scala> val jdbcDF2 = spark.read.jdbc("jdbc:mysql:dbserver", "schema.tablename", connectionProperties)
Wed Mar 29 15:05:23 IST 2017 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
jdbcDF2: org.apache.spark.sql.DataFrame = [id: bigint, site: bigint ... 15 more fields]

scala> jdbcDF2.count
Wed Mar 29 15:09:09 IST 2017 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.

如果是这种情况，有没有办法可以将数据保存在像DataFrame这样的火花本地对象中，这样它就不必一直连接到数据库了？

我尝试cache the table并且运行成功，但我无法在桌面上使用Spark-SQL

scala> jdbcDF2.cache()
res6: jdbcDF2.type = [id: bigint, site: bigint ... 15 more fields]
scala> val unique = sql("SELECT DISTINCT site FROM jdbcDF2")
org.apache.spark.sql.AnalysisException: Table or view not found: jdbcDF2;

Answer 1

您可以使用

直接在DataFrame上缓存后执行查询

val unique = jdbcDF2.selectExpr("count(distinct site)")

或

val unique = jdbcDF2.select("site").distinct.count

或从DataFrame创建临时视图并通过sqlContext访问它

jdbcDF2.createOrReplaceTempView("jdbcDF2")
val unique = sql("SELECT DISTINCT site FROM jdbcDF2")

Answer 2

你是正确的，你可以缓存你的DataFrame以供以后重用，以便在每次Spark操作时都不查询你的数据库（收集，统计，首先......）

但是要使用SQL查询DataFrame，首先要做的是：

jdbcDF2.createOrReplaceTempView("my_table")

然后：

sql("SELECT DISTINCT site FROM my_table")

每次执行Transformation / Action时Spark都连接到数据库？

2 个答案: