我想通过 Spark 直接控制RDB的读写速度,但是已经显示的标题相关参数似乎无法正常工作。
我可以断定fetchsize
和batchsize
不能使用我的测试方法吗?或者它们确实影响了阅读和写作的方面,因为测量结果是基于规模的合理。
betchsize
,fetchsize
和数据集/*Dataset*/
+--------------+-----------+
| Observations | Dataframe |
+--------------+-----------+
| 109,077 | Initial |
| 345,732 | Ultimate |
+--------------+-----------+
/*fetchsize*/
+-----------+-----------+------------------+------------------+
| fetchsize | batchsize | Reading Time(ms) | Writing Time(ms) |
+-----------+-----------+------------------+------------------+
| 10 | 10 | 2,103 | 38,428 |
| 100 | 10 | 2,123 | 38,021 |
| 1,000 | 10 | 2,032 | 38,345 |
| 10,000 | 10 | 2,016 | 37,892 |
| 50,000 | 10 | 2,017 | 37,795 |
| 100,000 | 10 | 2,055 | 38,720 |
+-----------+-----------+------------------+------------------+
/*batchsize*/
+-----------+-----------+------------------+------------------+
| fetchsize | batchsize | Reading Time(ms) | Writing Time(ms) |
+-----------+-----------+------------------+------------------+
| 10 | 10 | 2,072 | 37,977 |
| 10 | 100 | 2,077 | 36,990 |
| 10 | 1,000 | 2,034 | 36,703 |
| 10 | 10,000 | 1,979 | 36,980 |
| 10 | 50,000 | 2,043 | 36,749 |
| 10 | 100,000 | 2,005 | 36,624 |
+-----------+-----------+------------------+------------------+
我在 AWS 上创建了两个 m4.xlarge Linux实体,一个用于执行 Spark ,另一个用于执行数据存储RDB,使用 Datadog 来观察 Spark 应用程序的性能,尤其是在读取和写入RDB时。 Spark 处于独立模式,测试应用程序只是从 MySQL RDB中提取一些数据,进行一些计算,然后推回 MySQL < /强>
一些细节如下:
JDBC属性放在 application.conf 文件中,如下所示:
spark {
Reading {
url: "jdbc:mysql://address/designated database"
driver: "com.mysql.cj.jdbc.Driver"
user: "username"
password: "password"
fetchsize: "10000"
}
Writing {
url: "jdbc:mysql://address/designated database"
driver: "com.mysql.cj.jdbc.Driver"
dbtable: "designated table"
user: "username"
password: "password"
batchsize: "10000"
truncate: "true"
}
}
执行应用程序时记录由 log4jx2 启用,在其中测量写入时间。
.
.
.
startTime = System.nanoTime()
val connection = new Properties()
configureProperties(connection, conf, "spark.Writing")
val ultimateObservations = ultimateResult.count()
ultimateResult.write
.mode(SaveMode.Overwrite)
.jdbc(conf.getString("spark.Writing.url"),
conf.getString("spark.Writing.dbtable"),
connection)
finishedTime = System.nanoTime()
logger.info("Finished writing from Spark to MySQL, taking {} milliseconds; approximately {} rows/s",
TimeUnit.MILLISECONDS.convert((finishedTime - startTime), TimeUnit.NANOSECONDS),
ultimateObservations/TimeUnit.SECONDS.convert((finishedTime - startTime), TimeUnit.NANOSECONDS)
)
.
.
.
/*
*configureProperties is a customized function
*/
def configureProperties(connectionEntity: Properties, conf: Config, designatedString: String): Unit = {
val propertiesCarrier = conf.getConfig(designatedString)
for (entry <- propertiesCarrier.entrySet) {
if (entry.getKey().trim() != "url" && entry.getKey().trim() != "dbtable") {
connectionEntity.put(entry.getKey(), entry.getValue().unwrapped().toString())
logger.info("Database configuration: ({}, {}).",
entry.getKey(), entry.getValue().unwrapped().toString: Any)
}
}
}