Question

应用程序执行sysDF.write.partitionBy，并成功写出第一个镶木地板文件。但在那之后，应用程序挂起所有执行死亡，直到一些加班发生。 ACTION代码如下：

import sqlContext.implicits._

val systemRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[SystemLog]) basicLog.asInstanceOf[SystemLog] else null).filter(_ != null)
val sysDF = systemRDD.toDF()
sysDF.write.partitionBy("appId").parquet(outputPath + "/system/date=" + dateY4M2D2)

val customRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[CustomLog]) basicLog.asInstanceOf[CustomLog] else null).filter(_ != null)
val customDF = customRDD.toDF()
customDF.write.partitionBy("appId").parquet(outputPath + "/custom/date=" + dateY4M2D2)

val illegalRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[IllegalLog]) basicLog.asInstanceOf[IllegalLog] else null).filter(_ != null)
val illegalDF = illegalRDD.toDF()
illegalDF.write.partitionBy("appId").parquet(outputPath + "/illegal/date=" + dateY4M2D2)

Answer 1

首先，地图可以与过滤器结合使用，这应该稍微优化一下查询：

private async void buttondownload_Click(object sender, EventArgs e)
{
    try
    {
        using (FolderBrowserDialog fbd = new FolderBrowserDialog() { Description = "select your path ." })
        {
            if (fbd.ShowDialog() == DialogResult.OK)
            {                        
                var youtube = YouTube.Default;
                labelstatus.Text = "Downloading....";
                var video = await youtube.GetVideoAsync(textBoxurl.Text);
                //setting progress bar...............................??????

                File.WriteAllBytes(fbd.SelectedPath + video.FullName, await video.GetBytesAsync());
                labelstatus.Text = "Completed!";
            }
        }
    }

首先，最好在多次使用时缓存val rdd = basicLogRDD.cache() rdd.filter(_.isInstanceOf[SystemLog]).write.partitionBy("appId").parquet(outputPath + "/system/date=" + dateY4M2D2) rdd.filter(_.isInstanceOf[CustomLog]).write.partitionBy("appId").parquet(outputPath + "/custom/date=" + dateY4M2D2) rdd.filter(_.isInstanceOf[IllegalLog]).write.partitionBy("appId").parquet(outputPath + "/illegal/date=" + dateY4M2D2)。 basicLogRDD运算符将keep the RDD in memory。其次，不需要将RDD显式转换为DataFrame，因为它是隐含的implicitly converted to a DataFrame，允许使用Parquet存储它（您需要定义.cache()）。

使用DF书写时，Spark Job挂起

1 个答案: