RDD的foreachPartition方法内部出现意外行为

时间:2016-04-27 08:27:58

标签: scala apache-spark rdd

我通过spark-shell评估了以下scala代码行:

public static async Task SendMessageS(string domainName, ErrorSeverity severity, DateTime errorTime, Recipient recipient)
{

    try
    {
        string error = "";

        string fromEmail = "OwerWatch@mydomain.com";
        string toEmail = recipient.SendEmailTo;

        MailMessage message = new MailMessage(fromEmail, toEmail);
        Guid guid = Guid.NewGuid();
        SmtpClient smtpClient = new SmtpClient(server, port);

        /*if (_useAuthentication)*/
        smtpClient.Credentials = new NetworkCredential("", "");
        smtpClient.EnableSsl = false;

        //mail.Subject = subject;
        //mail.Body = body;

        message.Subject = "Problem ( " + severity + ") " + domainName;
        message.Body = BuildMessage(error, recipient.RecipientName, domainName, errorTime, severity);

        smtpClient.SendCompleted += SendCompletedCallback;

        await smtpClient.SendMailAsync(fromEmail, toEmail, message.Subject, message.Body  /* user state, can be any object*/);

    }
    catch (Exception ex)
    {
        Console.WriteLine(ex.Message);
    }
}

输出如下:

val a = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
val b = a.coalesce(1)
b.foreachPartition { p => 
  p.map(_ + 1).foreach(println)
  p.map(_ * 2).foreach(println)
}

为什么分区 p 在第一张地图后变空?

2 个答案:

答案 0 :(得分:6)

这对我来说并不奇怪,因为 p Iterator ,当你用地图浏览它时,它没有更多的值,并考虑到 length size 的快捷方式,其实现方式如下:

def size: Int = {
  var result = 0
  for (x <- self) result += 1
  result
}
你得到0。

答案 1 :(得分:1)

答案在scala doc http://www.scala-lang.org/api/2.11.8/#scala.collection.Iterator中。它明确指出在调用map方法之后必须丢弃迭代器( p 是迭代器)。