我正在用笔记本电脑的8gig做下面的事情,并在Intellij中运行代码。我正在与map
函数和scalaj
库并行调用3个api,并按如下方式计算调用每个api的时间:
val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x)))
执行spark.time
时,我期望有3组时间,但它给了我6组时间
Time taken: 14945 ms
Time taken: 21773 ms
Time taken: 22446 ms
Time taken: 6438 ms
Time taken: 6877 ms
Time taken: 7107 ms
我在这里缺少什么,实际上是对api的并行调用吗?
答案 0 :(得分:1)
实际上,仅那一段代码根本不会执行任何<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.2.1.slim.min.js" integrity="sha384-KJ3o2DKtIkvYIK3UENzmM7KCkRr/rE9/Qpg6aAZGJwFDMVNA/GpGFF93hXpG5KkN" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js" integrity="sha384-ApNbgh9B+Y1QKtv3Rn7W3mgPxhU9K/ScQsAP7hUibX39j7fakFPskvXusvfa0b4Q" crossorigin="anonymous"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js" integrity="sha384-JZR6Spejh4U02d8jOt6vLEHfe/JQGiRRSQQxSfFWpi1MquVdAyjUar5+76PVCmYl" crossorigin="anonymous"></script>
<nav class="navbar navbar-nav navbar-expand-md navbar-light bg-white
sticky-top ">
<div class="container">
<a href="index.php" class="navbar-brand "><img src="images/Logo.png" width="200" height="60"></a><button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#collapsenavbar">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse text-center" id="collapsenavbar">
<ul class="navbar-nav ml-auto">
<li class="nav-item active">
<a class="nav-link font-weight-bold" href="index.php">Home</a>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle font-weight-bold" href="#" id="dropdown01" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">Services</a>
<div class="dropdown-menu multi-column" aria-labelledby="dropdown01">
<div class="row">
<div class="col-md-3 col-sm-3 col-lg-3">
<ul class="multi-column-dropdown">
<li><a href="caterers.php">Caterers</a></li>
<hr />
<li><a href="decorer.php">Decorator</a></li>
<hr />
<li><a href="dholwale.php">Dhol Wale</a></li>
<hr />
</ul>
</div>
<div class="col-md-3 col-sm-3 col-lg-3">
<ul class="multi-column-dropdown">
<li><a href="flowrdecor.php">Flower Decorator</a></li>
<hr />
<li><a href="makeupart.php">Makeup Artist</a></li>
<hr />
<li><a href="mehandi.php">Mehandi Artist</a></li>
<hr />
</ul>
</div>
<div class="col-md-3 col-sm-3 col-lg-3">
<ul class="multi-column-dropdown">
<li><a href="photo.php">Photography</a></li>
<hr />
<li><a href="sound.php">Sound & DJ</a></li>
<hr />
<li><a href="Venue_1.php">Venue</a></li>
<hr />
</ul>
</div>
<div class="col-md-3 col-sm-3 col-lg-3">
<ul class="multi-column-dropdown">
<li><a href="cards.php">Wedding Cards</a></li>
<hr />
<li><a href="Wedplan.php">Wedding Planner</a></li>
<hr />
</ul>
</div>
</div>
</div>
</li>
</ul>
</div>
,map函数是惰性的,因此只有在对spark.time
执行操作之前,它不会被执行。您还应该考虑到,如果不坚持执行转换后的RDD
,它将为每个操作重新计算所有转换。这意味着如果您正在执行以下操作:
RDD
val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x)))
val c = actual_data.count()
actual_data.collect()
中定义的内容将有6次执行(map
中每个元素两次,RDD
中第一个,{{1}中第二次) }。为避免这种重新计算,您可以按以下方式缓存或保留count
collect
在第二个示例中,您只会看到3条日志,而不是6条