在spark / scala中创建名称的随机列表

时间:2016-06-16 09:53:16

标签: scala apache-spark yarn cloudera

我使用spark / scala在HDFS上的CSV文件中生成大约10万人,方法是将2个csv文件随机混合使用名字和姓氏+在1920年之间随机添加出生日期+现在+创建数据+计数器。

我使用for循环遇到一些问题一切都正常工作但在这种情况下循环部分只在驱动程序上运行,这似乎仅限于1毫升工作正常,产生10 mln大约需要10分钟。所以我决定创建一个包含10万个项目的范围,这样我就可以使用地图并利用群集了。我得到了以下代码:

package ebicus
import org.apache.spark._
import org.joda.time.{DateTime,Interval,LocalDateTime}
import org.joda.time.format.DateTimeFormat
import java.util.Random

   object main_generator_spark {
   val conf = new SparkConf()
        .setAppName("Generator")
   val sc = new SparkContext(conf)         

   val rs = Random            
   val file = sc.textFile("hdfs://host:8020/user/firstname")
   val fnames = file.flatMap(line => line.split("\n")).map(x => x.split(",")(1))

   val fnames_ar = fnames.collect()
   val fnames_size = fnames_ar.length
   val firstnames = sc.broadcast(fnames_ar)

   val file2 = sc.textFile("hdfs://host:8020/user/lastname")
   val lnames = file2.flatMap(line => line.split("\n")).map(x => x.split(",")(1))
   val lnames_ar =  lnames.collect()
   val lnames_size = lnames_ar.length
   val lastnames = sc.broadcast(lnames_ar)

   val range_val = sc.range(0, 100000000, 1, 20)

   val rddpersons = range_val.map(x =>
                            (x.toString, 
                            new DateTime().toString("y-M-d::H:m:s"),
 **Error at line 77-->      fnames_ar(rs.nextInt(fnames_size)),** 
                            lnames_ar(rs.nextInt(lnames_size)),
                            makeGebDate
                            )
      )

   def makeGebDate():String ={
     lazy val start = new DateTime(1920,1,1,0,0,0)
     lazy val end = new DateTime().minusYears(18)
     lazy val hours = (new Interval(start, end).toDurationMillis()./(1000*60*60)).toInt
     start.plusHours(rs.nextInt(hours)).toString("y-MM-dd")
   }

  def main(args: Array[String]): Unit = {      
     rddpersons.saveAsTextFile("hdfs://hdfs://host:8020/user/output")
 }

当我使用spark-shell时,代码工作正常但当我尝试使用spark-submit运行脚本时(我使用maven构建):

spark-submit --class ebicus.main_generator_spark --num-executors 16 --executor-cores 4 --executor-memory 2G --driver-cores 2 --driver-memory 10g /u01/stage/mvn_test-0.0.2.jar

我收到以下错误:

16/06/16 11:17:29 INFO DAGScheduler: Final stage: ResultStage 2(saveAsTextFile at main_generator_sprak.scala:93)
16/06/16 11:17:29 INFO DAGScheduler: Parents of final stage: List()
16/06/16 11:17:29 INFO DAGScheduler: Missing parents: List()
16/06/16 11:17:29 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[11] at saveAsTextFile at main_generator_sprak.scala:93), which has no missing parents
16/06/16 11:17:29 INFO MemoryStore: ensureFreeSpace(140536) called with curMem=1326969, maxMem=5556991426
16/06/16 11:17:29 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 137.2 KB, free 5.2 GB)
16/06/16 11:17:29 INFO MemoryStore: ensureFreeSpace(48992) called with curMem=1467505, maxMem=5556991426
16/06/16 11:17:29 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 47.8 KB, free 5.2 GB)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.29.7.4:51642 (size: 47.8 KB, free: 5.2 GB)
16/06/16 11:17:29 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
16/06/16 11:17:29 INFO DAGScheduler: Submitting 20 missing tasks from ResultStage 2 (MapPartitionsRDD[11] at saveAsTextFile at main_generator_sprak.scala:93)
16/06/16 11:17:29 INFO YarnScheduler: Adding task set 2.0 with 20 tasks
16/06/16 11:17:29 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 8, cloudera-001.fusion.ebicus.com, partition 0,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 9, cloudera-003.fusion.ebicus.com, partition 1,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 10, cloudera-001.fusion.ebicus.com, partition 2,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 11, cloudera-003.fusion.ebicus.com, partition 3,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 4.0 in stage 2.0 (TID 12, cloudera-001.fusion.ebicus.com, partition 4,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 5.0 in stage 2.0 (TID 13, cloudera-003.fusion.ebicus.com, partition 5,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 6.0 in stage 2.0 (TID 14, cloudera-001.fusion.ebicus.com, partition 6,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO TaskSetManager: Starting task 7.0 in stage 2.0 (TID 15, cloudera-003.fusion.ebicus.com, partition 7,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on cloudera-003.fusion.ebicus.com:52334 (size: 47.8 KB, free: 1060.2 MB)
16/06/16 11:17:29 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on cloudera-001.fusion.ebicus.com:53110 (size: 47.8 KB, free: 1060.2 MB)
16/06/16 11:17:30 INFO TaskSetManager: Starting task 8.0 in stage 2.0 (TID 16, cloudera-001.fusion.ebicus.com, partition 8,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:30 INFO TaskSetManager: Starting task 9.0 in stage 2.0 (TID 17, cloudera-001.fusion.ebicus.com, partition 9,PROCESS_LOCAL, 2029 bytes)
16/06/16 11:17:30 WARN TaskSetManager: Lost task 6.0 in stage 2.0 (TID 14, cloudera-001.fusion.ebicus.com): java.lang.NoClassDefFoundError: Could not initialize class ebicus.main_generator_spark$
        at ebicus.main_generator_spark$$anonfun$5.apply(main_generator_sprak.scala:77)
        at ebicus.main_generator_spark$$anonfun$5.apply(main_generator_sprak.scala:74)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1205)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

我是否犯了某种基本思维错误?如果有人能指出我的方向,我会很高兴吗?

编辑:我使用的是cloudera 5.6.0,spark 1.5.0,scala 2.10.6,yarn 2.10,joda-time 2.9.4

Edit2:添加了conf& SC

0 个答案:

没有答案