使用Spark-Bench生成数据不是并行完成的

时间:2019-06-26 10:19:36

标签: apache-spark hadoop yarn benchmarking

我正在YARN之上的集群模式下运行spark。目标是启动spark-bench(一个火花基准测试套件)以对集群上的I / O施加压力。 这是我用来生成数据并使用spark运行查询的文件(csv-vs-parquet.conf

 spark-bench = {
  spark-submit-config = [{
    spark-home = "/usr/local/spark" // PATH TO YOUR SPARK INSTALLATION
    spark-args = {
      master = "yarn" // FILL IN YOUR MASTER HERE
    //executor-memory = "14G" // FILL IN YOUR EXECUTOR MEMORY
      num-executors = 15
      executor-cores = 15
    }
    conf = {
      // Any configuration you need for your setup goes here, like:
       "spark.dynamicAllocation.enabled" = "false"
       "spark.dynamicAllocation.monitor.enabled" = "false"
      "spark.shuffle.service.enabled" = "false"
       "spark.sql.parquet.mergeSchema" = "true"
    }
    suites-parallel = false
    workload-suites = [
      {
        descr = "Generate a dataset, then take that same dataset and write it out to Parquet format"
       // benchmark-output = "file:///home/hadoop_fuse/result-dat-gen.csv"
        // We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet.
        parallel = false
        workloads = [


          {
            name = "data-generation-kmeans"
            rows = 500000000
            cols = 240
            output = "file:///tmp/hadoop_fuse/gen_data/kmeans-data.csv"
          },{

            name = "sql"
            query = "select * from input"
            input = "file:///tmp/hadoop_fuse/gen_data/kmeans-data.csv"
            output = "file:///tmp/hadoop_fuse/gen_data/kmeans-data.parquet"
          }
        ]
      },
      {
        descr = "Run two different SQL queries over the dataset in two different formats"
        benchmark-output = "file:///tmp/gen_data/csv-vs-parquet/results-sql.csv"
        parallel = false
        repeat = 1
        workloads = [
          {
            name = "sql"
            input = ["file:///tmp/hadoop_fuse/gen_data/kmeans-data.csv", "file:///tmp/hadoop_fuse/gen_data/kmeans-data.parquet"]
            query = ["select * from input", "select c0, c22 from input where c0 < -0.9"]
            cache = false
          }
        ]
      }
    ]
  }]
}

然后我使用命令./bin/spark-bench.sh examples/csv-vs-parquet.conf运行基准测试 棘手的部分是设置的这一部分:


          {
            name = "data-generation-kmeans"
            rows = 500000000
            cols = 240
            output = "file:///tmp/hadoop_fuse/gen_data/kmeans-data.csv"
          }

数据生成不是完全并行运行,因此只有8个节点中有2个由它负责,因此生成数据要花费很多时间。

同时,其余的基准测试(将csv文件转换为镶木地板,并执行查询)是并行完成的。

这是什么原因

请注意,目录/ tmp / hadoop_fuse在节点之间共享,因此在此目录中创建的所有内容都会出现在所有节点上!

0 个答案:

没有答案