我正在YARN之上的集群模式下运行spark。目标是启动spark-bench(一个火花基准测试套件)以对集群上的I / O施加压力。
这是我用来生成数据并使用spark运行查询的文件(csv-vs-parquet.conf
)
spark-bench = {
spark-submit-config = [{
spark-home = "/usr/local/spark" // PATH TO YOUR SPARK INSTALLATION
spark-args = {
master = "yarn" // FILL IN YOUR MASTER HERE
//executor-memory = "14G" // FILL IN YOUR EXECUTOR MEMORY
num-executors = 15
executor-cores = 15
}
conf = {
// Any configuration you need for your setup goes here, like:
"spark.dynamicAllocation.enabled" = "false"
"spark.dynamicAllocation.monitor.enabled" = "false"
"spark.shuffle.service.enabled" = "false"
"spark.sql.parquet.mergeSchema" = "true"
}
suites-parallel = false
workload-suites = [
{
descr = "Generate a dataset, then take that same dataset and write it out to Parquet format"
// benchmark-output = "file:///home/hadoop_fuse/result-dat-gen.csv"
// We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet.
parallel = false
workloads = [
{
name = "data-generation-kmeans"
rows = 500000000
cols = 240
output = "file:///tmp/hadoop_fuse/gen_data/kmeans-data.csv"
},{
name = "sql"
query = "select * from input"
input = "file:///tmp/hadoop_fuse/gen_data/kmeans-data.csv"
output = "file:///tmp/hadoop_fuse/gen_data/kmeans-data.parquet"
}
]
},
{
descr = "Run two different SQL queries over the dataset in two different formats"
benchmark-output = "file:///tmp/gen_data/csv-vs-parquet/results-sql.csv"
parallel = false
repeat = 1
workloads = [
{
name = "sql"
input = ["file:///tmp/hadoop_fuse/gen_data/kmeans-data.csv", "file:///tmp/hadoop_fuse/gen_data/kmeans-data.parquet"]
query = ["select * from input", "select c0, c22 from input where c0 < -0.9"]
cache = false
}
]
}
]
}]
}
然后我使用命令./bin/spark-bench.sh examples/csv-vs-parquet.conf
运行基准测试
棘手的部分是设置的这一部分:
{
name = "data-generation-kmeans"
rows = 500000000
cols = 240
output = "file:///tmp/hadoop_fuse/gen_data/kmeans-data.csv"
}
数据生成不是完全并行运行,因此只有8个节点中有2个由它负责,因此生成数据要花费很多时间。
同时,其余的基准测试(将csv文件转换为镶木地板,并执行查询)是并行完成的。
这是什么原因
请注意,目录/ tmp / hadoop_fuse在节点之间共享,因此在此目录中创建的所有内容都会出现在所有节点上!