有没有办法优化spark sql代码?

时间:2016-02-03 10:35:30

标签: scala hadoop apache-spark apache-spark-sql spark-dataframe

更新

我正在使用spark sql 1.5.2。试图读取许多镶木地板文件并过滤和聚合行 - 我的hdfs中有大约35M的行存储在~30个文件中,处理时间超过10分钟

val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
    "pid", 
    "timestamp", 
    "event_data.level" 
    ).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid",  "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()   

我的spark在两个节点集群中运行,每个集群有8个核心和16Gb ram,scala输出让我计算只运行一个线程:

scala> x.first()
[Stage 1:=======>                                               (50 + 1) / 368]

当我尝试count()而不是first()时,看起来两个线程正在进行计算。这仍然比我预期的要少,因为有大约30个文件可以并行处理

scala> l_12.count()   
[Stage 4:=====>                                                  (34 + 2) / 368]

启动火花控制台,执行器为14g,纱线客户端模式为4g

./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client

我对spark的默认配置:

spark.executor.memory              2g
spark.logConf                      true
spark.eventLog.dir                 maprfs:///apps/spark
spark.eventLog.enabled             true
spark.sql.hive.metastore.sharedPrefixes  com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address  http://test-01:18080

rdd有200个分区

scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200

有没有办法优化这段代码? 感谢

1 个答案:

答案 0 :(得分:2)

这两种行为或多或少都是预期的。 Spark是相当懒惰的,它不仅不会执行转换,除非你触发一个动作,但如果输出不需要也可以跳过任务。由于 <ToggleButton Grid.Row="4" Grid.Column="1" Style="{StaticResource AnimatedSwitch}" Height="20"/> <Style x:Key="AnimatedSwitch" TargetType="{x:Type ToggleButton}"> <Setter Property="Foreground" Value="Black" /> <Setter Property="Background" Value="#FAFAFB" /> <Setter Property="BorderBrush" Value="#EAEAEB" /> <Setter Property="Template"> <Setter.Value> <ControlTemplate TargetType="ToggleButton"> <Viewbox Stretch="Uniform"> <Canvas Name="Layer_1" Width="20" Height="20" Canvas.Left="10" Canvas.Top="0"> <Ellipse Canvas.Left="0" Width="20" Height="20" Fill="{TemplateBinding Background}" Stroke="{TemplateBinding BorderBrush}" StrokeThickness="0.5"/> <Ellipse Canvas.Left="15" Width="20" Height="20" Fill="{TemplateBinding Background}" Stroke="{TemplateBinding BorderBrush}" StrokeThickness="0.5"/> <Border Canvas.Left="10" Width="15" Height="20" Name="rect416927" Background="{TemplateBinding Background}" BorderBrush="{TemplateBinding BorderBrush}" BorderThickness="0,0.5,0,0.5"/> <Ellipse x:Name="ellipse" Canvas.Left="0" Width="20" Height="20" Fill="White" Stroke="{TemplateBinding BorderBrush}" StrokeThickness="0.3"> <Ellipse.RenderTransform> <TranslateTransform X="0" Y="0" /> </Ellipse.RenderTransform> <Ellipse.BitmapEffect> <DropShadowBitmapEffect Softness="0.1" ShadowDepth="0.7" Direction="270" Color="#BBBBBB"/> </Ellipse.BitmapEffect> </Ellipse> </Canvas> </Viewbox> <ControlTemplate.Triggers> <Trigger Property="IsChecked" Value="True" > <Trigger.EnterActions> <BeginStoryboard> <Storyboard> <ColorAnimation Storyboard.TargetProperty="Background.Color" To="#52D468" Duration="0:0:0.2" /> <ColorAnimation Storyboard.TargetProperty="BorderBrush.Color" To="#41C955" Duration="0:0:0.2" /> <DoubleAnimationUsingKeyFrames Storyboard.TargetProperty="(Ellipse.RenderTransform).(TranslateTransform.X)" Storyboard.TargetName="ellipse"> <SplineDoubleKeyFrame KeyTime="0" Value="0"/> <SplineDoubleKeyFrame KeyTime="0:0:0.4" Value="15" KeySpline="0, 1, 0.6, 1"/> </DoubleAnimationUsingKeyFrames> </Storyboard> </BeginStoryboard> <BeginStoryboard> <Storyboard> <DoubleAnimation From="0" To="150" Storyboard.TargetName="grdEditBookmark" Storyboard.TargetProperty="Height" Duration="0:0:0.5" AccelerationRatio="0.10" DecelerationRatio="0.25" ></DoubleAnimation> </Storyboard> </BeginStoryboard> </Trigger.EnterActions> <Trigger.ExitActions> <BeginStoryboard> <Storyboard> <ColorAnimation Storyboard.TargetProperty="Background.Color" To="#FAFAFB" Duration="0:0:0.2" /> <ColorAnimation Storyboard.TargetProperty="BorderBrush.Color" To="#EAEAEB" Duration="0:0:0.2" /> <DoubleAnimationUsingKeyFrames Storyboard.TargetProperty="(Ellipse.RenderTransform).(TranslateTransform.X)" Storyboard.TargetName="ellipse"> <SplineDoubleKeyFrame KeyTime="0" Value="15"/> <SplineDoubleKeyFrame KeyTime="0:0:0.3" Value="0" KeySpline="0, 0.5, 0.5, 1"/> </DoubleAnimationUsingKeyFrames> </Storyboard> </BeginStoryboard> </Trigger.ExitActions> </Trigger> </ControlTemplate.Triggers> </ControlTemplate> </Setter.Value> </Setter> </Style> 只需要一个元素,因此它只能计算一个分区。这很可能是你在某个时刻只看到一个正在运行的线程的原因。

关于第二个问题,很可能是配置问题。假设YARN配置没有任何问题(我不使用YARN但first看起来像是问题的可能来源),这很可能是Spark默认值的问题。您可以在Configuration guide yarn.nodemanager.resource.cpu-vcores上阅读Yarn默认情况下设置为1.两个工作人员提供两个正在运行的线程。