使用DF书写时,Spark Job挂起

时间:2016-09-14 07:08:39

标签: apache-spark dataframe apache-spark-sql

应用程序执行sysDF.write.partitionBy,并成功写出第一个镶木地板文件。但在那之后,应用程序挂起所有执行死亡,直到一些加班发生。 ACTION代码如下:

import sqlContext.implicits._

val systemRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[SystemLog]) basicLog.asInstanceOf[SystemLog] else null).filter(_ != null)
val sysDF = systemRDD.toDF()
sysDF.write.partitionBy("appId").parquet(outputPath + "/system/date=" + dateY4M2D2)

val customRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[CustomLog]) basicLog.asInstanceOf[CustomLog] else null).filter(_ != null)
val customDF = customRDD.toDF()
customDF.write.partitionBy("appId").parquet(outputPath + "/custom/date=" + dateY4M2D2)

val illegalRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[IllegalLog]) basicLog.asInstanceOf[IllegalLog] else null).filter(_ != null)
val illegalDF = illegalRDD.toDF()
illegalDF.write.partitionBy("appId").parquet(outputPath + "/illegal/date=" + dateY4M2D2)

1 个答案:

答案 0 :(得分:0)

首先,地图可以与过滤器结合使用,这应该稍微优化一下查询:

private async void buttondownload_Click(object sender, EventArgs e)
{
    try
    {
        using (FolderBrowserDialog fbd = new FolderBrowserDialog() { Description = "select your path ." })
        {
            if (fbd.ShowDialog() == DialogResult.OK)
            {                        
                var youtube = YouTube.Default;
                labelstatus.Text = "Downloading....";
                var video = await youtube.GetVideoAsync(textBoxurl.Text);
                //setting progress bar...............................??????

                File.WriteAllBytes(fbd.SelectedPath + video.FullName, await video.GetBytesAsync());
                labelstatus.Text = "Completed!";
            }
        }
    }

首先,最好在多次使用时缓存val rdd = basicLogRDD.cache() rdd.filter(_.isInstanceOf[SystemLog]).write.partitionBy("appId").parquet(outputPath + "/system/date=" + dateY4M2D2) rdd.filter(_.isInstanceOf[CustomLog]).write.partitionBy("appId").parquet(outputPath + "/custom/date=" + dateY4M2D2) rdd.filter(_.isInstanceOf[IllegalLog]).write.partitionBy("appId").parquet(outputPath + "/illegal/date=" + dateY4M2D2) basicLogRDD运算符将keep the RDD in memory。 其次,不需要将RDD显式转换为DataFrame,因为它是隐含的implicitly converted to a DataFrame,允许使用Parquet存储它(您需要定义.cache())。