Spark工作因连接管理器问题而陷入困境

时间:2014-09-10 20:09:28

标签: apache-spark

我的应用程序卡住了这条消息..

14/09/10 18:11:45 INFO ConnectionManager: Accepted connection from [ip-xx-xx-xxx-xx.ec2.internal/10.33.139.85] 14/09/10 18:11:46 INFO SendingConnection: Initiating connection to [ip-1xx-xx-xxx-xx.ec2.internal/10.33.139.85:44309] 14/09/10 18:11:46 INFO SendingConnection: Connected to [ip-xx-xx-xxx-xx.ec2.internal/10.33.139.85:44309], 1 messages pending

我正在EMR集群上执行此操作。

Spark版本:1.0.1 [Hadoop 2.2]

请提出一些建议......

1 个答案:

答案 0 :(得分:0)

我正在回答我自己的问题......

这是因为底层数据严重偏差,一个节点获取所有数据并且无法处理...内存接近100%并且如果我等了几分钟就会失败... < / p>

我可以通过在连接条件中引入一个额外的基于随机数的键来解决这个问题......

示例:

import scala.util.Random

case class dimension(key: Int, id: Int, start_date: Int, end_date: Int,skew_mp: Int)

val sequenceNumberTableForSkewJoin = sc.parallelize(0 to 199)

sequenceNumberTableForSkewJoin.map { row => skw(row.toInt) }.registerAsTable("skw")

sc.textFile("S3Pathtoreadthedimension").cartesian(sequenceNumberTableForSkewJoin).map {  case(row,skew) =>
  val parts = row.split("\t")
  dimension(parts(6).toInt, parts(0).toInt, parts(7).toInt, parts(8).toInt,skew)
}.registerAsTable("dimension")

case class fact(dim_key1: Int,
  dim_key2: String,
  skew_resolver: Int,
  measures: Double)

sc.textFile("S3PathtoReadFactTable").map { row =>
     val parts = row.split("\t")
      fact(parts(0).toInt,
        parts(1),
        Random.nextInt(200),
        measures
    }.registerAsTable("fact")

sqlContext.sql("select * from dimension,fact where dimension.key=fact.dim_key1 and dimension.skew=fact.skew_resolver").saveAsTextFile("outputpath")

由于