fetchsize和batchsize对Spark的影响

时间:2017-08-09 11:39:24

标签: database performance apache-spark spark-dataframe

我想通过 Spark 直接控制RDB的读写速度,但是已经显示的标题相关参数似乎无法正常工作。

我可以断定fetchsizebatchsize不能使用我的测试方法吗?或者它们确实影响了阅读和写作的方面,因为测量结果是基于规模的合理。

betchsizefetchsize和数据集

的统计信息
/*Dataset*/
+--------------+-----------+
| Observations | Dataframe |
+--------------+-----------+
|      109,077 | Initial   |
|      345,732 | Ultimate  |
+--------------+-----------+
/*fetchsize*/
+-----------+-----------+------------------+------------------+
| fetchsize | batchsize | Reading Time(ms) | Writing Time(ms) |
+-----------+-----------+------------------+------------------+
|        10 |        10 |            2,103 |           38,428 |
|       100 |        10 |            2,123 |           38,021 |
|     1,000 |        10 |            2,032 |           38,345 |
|    10,000 |        10 |            2,016 |           37,892 |
|    50,000 |        10 |            2,017 |           37,795 |
|   100,000 |        10 |            2,055 |           38,720 |
+-----------+-----------+------------------+------------------+
/*batchsize*/
+-----------+-----------+------------------+------------------+
| fetchsize | batchsize | Reading Time(ms) | Writing Time(ms) |
+-----------+-----------+------------------+------------------+
|        10 |        10 |            2,072 |           37,977 |
|        10 |       100 |            2,077 |           36,990 |
|        10 |     1,000 |            2,034 |           36,703 |
|        10 |    10,000 |            1,979 |           36,980 |
|        10 |    50,000 |            2,043 |           36,749 |
|        10 |   100,000 |            2,005 |           36,624 |
+-----------+-----------+------------------+------------------+

Datadog

观察的度量

MySQL performance measures

可能有帮助的细节

我在 AWS 上创建了两个 m4.xlarge Linux实体,一个用于执行 Spark ,另一个用于执行数据存储RDB,使用 Datadog 来观察 Spark 应用程序的性能,尤其是在读取和写入RDB时。 Spark 处于独立模式,测试应用程序只是从 MySQL RDB中提取一些数据,进行一些计算,然后推回 MySQL < /强>

一些细节如下:

  1. JDBC属性放在 application.conf 文件中,如下所示:

    spark {
      Reading {
        url: "jdbc:mysql://address/designated database"
        driver: "com.mysql.cj.jdbc.Driver"
        user: "username"
        password: "password"
        fetchsize: "10000"
      }
      Writing {
        url: "jdbc:mysql://address/designated database"
        driver: "com.mysql.cj.jdbc.Driver"
        dbtable: "designated table"
        user: "username"
        password: "password"
        batchsize: "10000"
        truncate: "true"
      }
    }
    
  2. 执行应用程序时记录由 log4jx2 启用,在其中测量写入时间。

                    .
                    .
                    .
    startTime = System.nanoTime()
    val connection = new Properties()
    configureProperties(connection, conf, "spark.Writing")
    val ultimateObservations = ultimateResult.count()
    ultimateResult.write
        .mode(SaveMode.Overwrite)
        .jdbc(conf.getString("spark.Writing.url"),
              conf.getString("spark.Writing.dbtable"),
              connection)
    finishedTime = System.nanoTime()
    logger.info("Finished writing from Spark to MySQL, taking {} milliseconds; approximately {} rows/s",
          TimeUnit.MILLISECONDS.convert((finishedTime - startTime), TimeUnit.NANOSECONDS),
          ultimateObservations/TimeUnit.SECONDS.convert((finishedTime - startTime), TimeUnit.NANOSECONDS)
        )
                    .
                    .
                    .
    
    /*
     *configureProperties is a customized function
     */
    def configureProperties(connectionEntity: Properties, conf: Config, designatedString: String): Unit = {
        val propertiesCarrier = conf.getConfig(designatedString)
        for (entry <- propertiesCarrier.entrySet) {
          if (entry.getKey().trim() != "url" && entry.getKey().trim() != "dbtable") {
            connectionEntity.put(entry.getKey(), entry.getValue().unwrapped().toString())
            logger.info("Database configuration: ({}, {}).",
      entry.getKey(), entry.getValue().unwrapped().toString: Any)
        }
      }
    }
    

0 个答案:

没有答案