Spark:坚持和重新分配订单

时间:2015-11-12 23:03:52

标签: apache-spark rdd partition persist

我有以下代码:

val data = input.map{... }.persist(StorageLevel.MEMORY_ONLY_SER).repartition(2000)

我想知道如果我首先进行重新分配会有什么不同:

val data = input.map{... }.repartition(2000).persist(StorageLevel.MEMORY_ONLY_SER)

调用赔偿和坚持的顺序是否存在差异?谢谢!

1 个答案:

答案 0 :(得分:12)

是的,有区别。

在第一种情况下,你会在地图阶段后获得持久的RDD。这意味着每次访问<asp:Content ID="Content2" ContentPlaceHolderID="ContentPlaceHolder1" runat="Server"> <div class="center"> <h2><asp:Label ID="lblSearch" runat="server" Text=""></asp:Label></h2> </div> <div id="listview"> <asp:ListView ID="ListView1" runat="server" DataKeyNames="ProductId" DataSourceID="ProductDataSource" GroupItemCount="3"> <EmptyDataTemplate> Search not found. Please try again. </EmptyDataTemplate> <GroupTemplate> <tr id="itemPlaceholderContainer" runat="server"> <td id="itemPlaceholder" runat="server"></td> </tr> </GroupTemplate> <ItemTemplate> <td runat="server" style=""> <asp:ImageButton ID="ImageButton1" runat="server" ImageUrl='<%# Eval("ImageUrl") %>' CssClass="productimage" /> <br /> <asp:Label ID="ProductNameLabel" runat="server" Text='<%# Eval("ProductName") %>' /> <br /> <asp:Label ID="PriceLabel" runat="server" Text='<%# String.Format("{0:C}", Eval("Price")) %>' /> </td> </ItemTemplate> <LayoutTemplate> <table runat="server"> <tr runat="server"> <td runat="server"> <table id="groupPlaceholderContainer" runat="server" border="0" style=""> <tr id="groupPlaceholder" runat="server"> </tr> </table> </td> </tr> <tr runat="server"> <td runat="server" style=""></td> </tr> </table> </LayoutTemplate> <SelectedItemTemplate> <td runat="server" style=""> <asp:Label ID="ProductNameLabel" runat="server" Text='<%# Eval("ProductName") %>' /> <br /> <asp:Label ID="PriceLabel" runat="server" Text='<%# String.Format("{0:C}", Eval("Price")) %>' /> <br /> </td> </SelectedItemTemplate> </asp:ListView> </div> <asp:SqlDataSource ID="ProductDataSource" runat="server" ConnectionString="<%$ ConnectionStrings:ProjectConnectionString1 %>" SelectCommand="SELECT * FROM [Product] WHERE ([ProductName] LIKE '%' + @Search + '%')"> <SelectParameters> <asp:ControlParameter ControlID="lblSearch" Name="Search" PropertyName="Text" Type="String" /> </SelectParameters> </asp:SqlDataSource> </asp:Content> 时都会触发data

在第二种情况下,您在重新分区后进行缓存。访问repartition并且之前已实现时,无需执行其他工作。

要证明我们可以做一个实验:

data

并查看存储信息:

import  org.apache.spark.storage.StorageLevel

val data1 = sc.parallelize(1 to 10, 8)
  .map(identity)
  .persist(StorageLevel.MEMORY_ONLY_SER)
  .repartition(2000)
data1.count()

val data2 = sc.parallelize(1 to 10, 8)
  .map(identity)
  .repartition(2000)
  .persist(StorageLevel.MEMORY_ONLY_SER)
data2.count()

如您所见,有两个持久的RDD,一个有2000个分区,另一个有8个。