用于API调用的Spark中并行执行的Spark时间

时间:2018-09-25 07:35:23

标签: scala apache-spark

我正在用笔记本电脑的8gig做下面的事情,并在Intellij中运行代码。我正在与map函数和scalaj库并行调用3个api,并按如下方式计算调用每个api的时间:

val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data 
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x))) 

执行spark.time时,我期望有3组时间,但它给了我6组时间

Time taken: 14945 ms
Time taken: 21773 ms
Time taken: 22446 ms
Time taken: 6438 ms
Time taken: 6877 ms
Time taken: 7107 ms

我在这里缺少什么,实际上是对api的并行调用吗?

1 个答案:

答案 0 :(得分:1)

实际上,仅那一段代码根本不会执行任何<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous"> <script src="https://code.jquery.com/jquery-3.2.1.slim.min.js" integrity="sha384-KJ3o2DKtIkvYIK3UENzmM7KCkRr/rE9/Qpg6aAZGJwFDMVNA/GpGFF93hXpG5KkN" crossorigin="anonymous"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js" integrity="sha384-ApNbgh9B+Y1QKtv3Rn7W3mgPxhU9K/ScQsAP7hUibX39j7fakFPskvXusvfa0b4Q" crossorigin="anonymous"></script> <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js" integrity="sha384-JZR6Spejh4U02d8jOt6vLEHfe/JQGiRRSQQxSfFWpi1MquVdAyjUar5+76PVCmYl" crossorigin="anonymous"></script> <nav class="navbar navbar-nav navbar-expand-md navbar-light bg-white sticky-top "> <div class="container"> <a href="index.php" class="navbar-brand "><img src="images/Logo.png" width="200" height="60"></a><button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#collapsenavbar"> <span class="navbar-toggler-icon"></span> </button> <div class="collapse navbar-collapse text-center" id="collapsenavbar"> <ul class="navbar-nav ml-auto"> <li class="nav-item active"> <a class="nav-link font-weight-bold" href="index.php">Home</a> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle font-weight-bold" href="#" id="dropdown01" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">Services</a> <div class="dropdown-menu multi-column" aria-labelledby="dropdown01"> <div class="row"> <div class="col-md-3 col-sm-3 col-lg-3"> <ul class="multi-column-dropdown"> <li><a href="caterers.php">Caterers</a></li> <hr /> <li><a href="decorer.php">Decorator</a></li> <hr /> <li><a href="dholwale.php">Dhol Wale</a></li> <hr /> </ul> </div> <div class="col-md-3 col-sm-3 col-lg-3"> <ul class="multi-column-dropdown"> <li><a href="flowrdecor.php">Flower Decorator</a></li> <hr /> <li><a href="makeupart.php">Makeup Artist</a></li> <hr /> <li><a href="mehandi.php">Mehandi Artist</a></li> <hr /> </ul> </div> <div class="col-md-3 col-sm-3 col-lg-3"> <ul class="multi-column-dropdown"> <li><a href="photo.php">Photography</a></li> <hr /> <li><a href="sound.php">Sound & DJ</a></li> <hr /> <li><a href="Venue_1.php">Venue</a></li> <hr /> </ul> </div> <div class="col-md-3 col-sm-3 col-lg-3"> <ul class="multi-column-dropdown"> <li><a href="cards.php">Wedding Cards</a></li> <hr /> <li><a href="Wedplan.php">Wedding Planner</a></li> <hr /> </ul> </div> </div> </div> </li> </ul> </div>,map函数是惰性的,因此只有在对spark.time执行操作之前,它不会被执行。您还应该考虑到,如果不坚持执行转换后的RDD,它将为每个操作重新计算所有转换。这意味着如果您正在执行以下操作:

RDD

val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3")) //for each API call,execute them in different executor and collate data val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x))) val c = actual_data.count() actual_data.collect() 中定义的内容将有6次执行(map中每个元素两次,RDD中第一个,{{1}中第二次) }。为避免这种重新计算,您可以按以下方式缓存或保留count

collect

在第二个示例中,您只会看到3条日志,而不是6条