这个程序是按照我假设的顺序执行的吗?

时间:2016-04-06 17:16:27

标签: scala apache-spark

我有一个应该

的程序
  1. 计算每个单词在语料库中出现的次数
  2. 获取计数阈值并将其用作Word2Vec培训的输入
  3. 以下程序是我完成此任务的目的。但是,我从日志文件中看到

    model.save(sc, outputFilePath)
    

    似乎在计数任务后立即执行。它似乎实际上等待第二个任务完成。结果是没有模型的空目录。

      def main(args: Array[String]) {
    
        val inputFile = new File(args(0))
        val outputDirectory = new File(args(1))
        outputDirectory.mkdirs()
        val wordsToKeepCount = args(2).toInt    
        val conf = new SparkConf().setAppName("Word2VecOnCluster")    
        val sc = new SparkContext(conf)
        val file = sc.textFile(inputFile.getAbsolutePath)
    
        // Task 1: Count the occurrence of words
    
        val wordCounts = file
          .repartition(500)
          .mapPartitions(lineIterator => preprocessing(lineIterator))
          .flatMap(line => {
            line.split("\\s+").toSeq
          })
          .map((_, 1)).reduceByKey(_ + _)
    
        val wordCountsList = wordCounts.collect()
    
        Sorting.stableSort(wordCountsList, (a : (String, Int), b : (String, Int)) => {
          a._2 > b._2
        })
    
    
        // Task 2: Train word vectors and use
        //         wordCountsList(wordsToKeepCount)._2 as min-count
        //         for words
    
        val wordSequence = file
          .repartition(500)
          .mapPartitions( lineIterator => preprocessing(lineIterator))
          .map(line => {
            line.split("\\s+").toSeq
          })
    
        val word2vec = new Word2Vec()
        val model = word2vec
          .setMinCount(wordCountsList(wordsToKeepCount)._2)
          .setNumPartitions(20)
          .fit(wordSequence)
    
        val outputFilePath =  outputDirectory.getAbsolutePath + File.separator + inputFile.getName    
        val f = new File(outputFilePath)
    
        if(f.exists()) {
          FileUtils.deleteDirectory(f)
        }
    
        model.save(sc, outputFilePath)
    
        println("All done.")    
        System.exit(0)
      }
    

    那么,这实际上是在做我想做的事情而其他事情是错误的,还是真的是model.save()被调用的问题,因为驱动程序不等待"第二个任务& #34;

    日志记录:

    Log Type: directory.info
    Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
    Log Length: 5142
    Showing 4096 bytes of 5142 total. Click here for the full log.
    spark__.jar
    35028667369    4 drwx------   2 sfalk    cluster      4096 Apr  6 18:03 ./__spark_conf__
    35028667370    8 -r-x------   1 sfalk    cluster      7154 Apr  6 18:03 ./__spark_conf__/mapred-site.xml
    35028667371    8 -r-x------   1 sfalk    cluster      5342 Apr  6 18:03 ./__spark_conf__/hadoop-env.sh
    35028667372    4 -r-x------   1 sfalk    cluster       620 Apr  6 18:03 ./__spark_conf__/log4j.properties
    35028668000    4 -r-x------   1 sfalk    cluster      1994 Apr  6 18:03 ./__spark_conf__/hadoop-metrics2.properties
    35028668001   20 -r-x------   1 sfalk    cluster     19859 Apr  6 18:03 ./__spark_conf__/yarn-site.xml
    35028668002    4 -r-x------   1 sfalk    cluster      3979 Apr  6 18:03 ./__spark_conf__/hadoop-env.cmd
    35028668003    0 -r-x------   1 sfalk    cluster         0 Apr  6 18:03 ./__spark_conf__/yarn.exclude
    35028668004    8 -r-x------   1 sfalk    cluster      5358 Apr  6 18:03 ./__spark_conf__/core-site.xml
    35028668005    4 -r-x------   1 sfalk    cluster      1631 Apr  6 18:03 ./__spark_conf__/kms-log4j.properties
    35028668006    4 -r-x------   1 sfalk    cluster      2250 Apr  6 18:03 ./__spark_conf__/yarn-env.cmd
    35028668007    4 -r-x------   1 sfalk    cluster       884 Apr  6 18:03 ./__spark_conf__/ssl-client.xml
    35028668008    4 -r-x------   1 sfalk    cluster      2313 Apr  6 18:03 ./__spark_conf__/capacity-scheduler.xml
    35028668009    4 -r-x------   1 sfalk    cluster      3518 Apr  6 18:03 ./__spark_conf__/kms-acls.xml
    35028668010    4 -r-x------   1 sfalk    cluster      2358 Apr  6 18:03 ./__spark_conf__/topology_script.py
    35028668011    4 -r-x------   1 sfalk    cluster       758 Apr  6 18:03 ./__spark_conf__/mapred-site.xml.template
    35028668039    4 -r-x------   1 sfalk    cluster      1335 Apr  6 18:03 ./__spark_conf__/configuration.xsl
    35028668040    8 -r-x------   1 sfalk    cluster      5104 Apr  6 18:03 ./__spark_conf__/yarn-env.sh
    35028668041   12 -r-x------   1 sfalk    cluster      8398 Apr  6 18:03 ./__spark_conf__/hdfs-site.xml
    35028668042    4 -r-x------   1 sfalk    cluster      1020 Apr  6 18:03 ./__spark_conf__/commons-logging.properties
    35028668043    4 -r-x------   1 sfalk    cluster      1033 Apr  6 18:03 ./__spark_conf__/container-executor.cfg
    35028668044    8 -r-x------   1 sfalk    cluster      4221 Apr  6 18:03 ./__spark_conf__/task-log4j.properties
    35028668045    4 -r-x------   1 sfalk    cluster      2490 Apr  6 18:03 ./__spark_conf__/hadoop-metrics.properties
    35028668046    4 -r-x------   1 sfalk    cluster       856 Apr  6 18:03 ./__spark_conf__/mapred-env.sh
    35028668047    4 -r-x------   1 sfalk    cluster      1602 Apr  6 18:03 ./__spark_conf__/health_check
    35028668048    4 -r-x------   1 sfalk    cluster      2316 Apr  6 18:03 ./__spark_conf__/ssl-client.xml.example
    35028668061    4 -r-x------   1 sfalk    cluster      1527 Apr  6 18:03 ./__spark_conf__/kms-env.sh
    35028668062    4 -r-x------   1 sfalk    cluster      1308 Apr  6 18:03 ./__spark_conf__/hadoop-policy.xml
    35028668063    4 -r-x------   1 sfalk    cluster       436 Apr  6 18:03 ./__spark_conf__/slaves
    35028774647    4 -r-x------   1 sfalk    cluster      1084 Apr  6 18:03 ./__spark_conf__/topology_mappings.data
    35028774648    4 -r-x------   1 sfalk    cluster       951 Apr  6 18:03 ./__spark_conf__/mapred-env.cmd
    35028774651    4 -r-x------   1 sfalk    cluster      1000 Apr  6 18:03 ./__spark_conf__/ssl-server.xml
    35028774652    4 -r-x------   1 sfalk    cluster      2268 Apr  6 18:03 ./__spark_conf__/ssl-server.xml.example
    35028861113    4 -r-x------   1 sfalk    cluster       945 Apr  6 18:03 ./__spark_conf__/taskcontroller.cfg
    35028861114    8 -r-x------   1 sfalk    cluster      5511 Apr  6 18:03 ./__spark_conf__/kms-site.xml
    35028861115    8 -r-x------   1 sfalk    cluster      4113 Apr  6 18:03 ./__spark_conf__/mapred-queues.xml.template
    35028861116    4 -r-x------   1 sfalk    cluster      1052 Apr  6 18:03 ./__spark_conf__/__spark_conf__.properties
    42949879089 1249344 -r-x------   1 sfalk    cluster  1279326465 Apr  6 18:03 ./__app__.jar
    broken symlinks(find -L . -maxdepth 5 -type l -ls):
    
    Log Type: launch_container.sh
    Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
    Log Length: 5508
    Showing 4096 bytes of 5508 total. Click here for the full log.
    ger"
    export CLASSPATH="$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/2.4.0.0-169/hadoop/lib/hadoop-lzo-0.6.0.2.4.0.0-169.jar:/etc/hadoop/conf/secure"
    export SPARK_YARN_MODE="true"
    export SPARK_YARN_CACHE_FILES_VISIBILITIES="PUBLIC,PRIVATE"
    export HADOOP_TOKEN_FILE_LOCATION="/hadoop/hadoop/yarn/local/usercache/sfalk/appcache/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/container_tokens"
    export NM_AUX_SERVICE_spark_shuffle=""
    export SPARK_USER="sfalk"
    export HOME="/home/"
    export CONTAINER_ID="container_e05_1459928874908_0015_02_000001"
    export MALLOC_ARENA_MAX="4"
    ln -sf "/hadoop/hadoop/yarn/local/filecache/25/spark-hdp-assembly.jar" "__spark__.jar"
    hadoop_shell_errorcode=$?
    if [ $hadoop_shell_errorcode -ne 0 ]
    then
      exit $hadoop_shell_errorcode
    fi
    ln -sf "/hadoop/hadoop/yarn/local/usercache/sfalk/filecache/13/__spark_conf__8544006202028925038.zip" "__spark_conf__"
    hadoop_shell_errorcode=$?
    if [ $hadoop_shell_errorcode -ne 0 ]
    then
      exit $hadoop_shell_errorcode
    fi
    ln -sf "/hadoop/hadoop/yarn/local/usercache/sfalk/filecache/14/wordvectors-final.jar" "__app__.jar"
    hadoop_shell_errorcode=$?
    if [ $hadoop_shell_errorcode -ne 0 ]
    then
      exit $hadoop_shell_errorcode
    fi
    # Creating copy of launch script
    cp "launch_container.sh" "/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/launch_container.sh"
    chmod 640 "/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/launch_container.sh"
    # Determining directory contents
    echo "ls -l:" 1>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
    ls -l 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
    echo "find -L . -maxdepth 5 -ls:" 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
    find -L . -maxdepth 5 -ls 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
    echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
    find -L . -maxdepth 5 -type l -ls 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
    exec /bin/bash -c "$JAVA_HOME/bin/java -server -Xmx65536m -Djava.io.tmpdir=$PWD/tmp -Dhdp.version=2.4.0.0-169 -Dspark.yarn.app.container.log.dir=/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'masterthesis.code.wordvectors.Word2VecOnCluster2' --jar file:/home/sfalk/./deploy/masterthesis/code/wordvectors/final/wordvectors-final.jar --arg '/datasets/amazonreviews/reviews_Electronics.json' --arg '/user/sfalk/amazonresults' --arg '100000' --executor-memory 40960m --executor-cores 8 --properties-file $PWD/__spark_conf__/__spark_conf__.properties 1> /hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/stdout 2> /hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/stderr"
    hadoop_shell_errorcode=$?
    if [ $hadoop_shell_errorcode -ne 0 ]
    then
      exit $hadoop_shell_errorcode
    fi
    
    Log Type: stderr
    Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
    Log Length: 924945
    Showing 4096 bytes of 924945 total. Click here for the full log.
    extHandler{/api,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
    16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
    16/04/06 19:01:38 INFO SparkUI: Stopped Spark web UI at http://192.168.0.109:37248
    16/04/06 19:01:38 INFO YarnClusterSchedulerBackend: Shutting down all executors
    16/04/06 19:01:38 INFO YarnClusterSchedulerBackend: Asking each executor to shut down
    16/04/06 19:01:38 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
    (serviceOption=None,
     services=List(),
     started=false)
    16/04/06 19:01:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    16/04/06 19:01:38 INFO MemoryStore: MemoryStore cleared
    16/04/06 19:01:38 INFO BlockManager: BlockManager stopped
    16/04/06 19:01:38 INFO BlockManagerMaster: BlockManagerMaster stopped
    16/04/06 19:01:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    16/04/06 19:01:38 INFO SparkContext: Successfully stopped SparkContext
    16/04/06 19:01:38 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master01.hadoop.know-center.at:8020/user/sfalk/amazonresults/reviews_Electronics.json/metadata already exists)
    16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
    16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
    16/04/06 19:01:38 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
    16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
    16/04/06 19:01:38 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1459928874908_0015
    16/04/06 19:01:38 INFO ShutdownHookManager: Shutdown hook called
    16/04/06 19:01:38 INFO ShutdownHookManager: Deleting directory /hadoop/hadoop/yarn/local/usercache/sfalk/appcache/application_1459928874908_0015/spark-89fdf1b5-df2a-47e3-ae92-a98f4109c177
    
    Log Type: stdout
    Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
    Log Length: 32332
    Showing 4096 bytes of 32332 total. Click here for the full log.
    : 77482
          Word: information count: 77091
          Word: across count: 77077
          ...
          Word: arm count: 62987
          Word: ac count: 62777
      Target count: 24 to keep 100000 words.
    Keeping 100000 words.
    Saving model to /user/sfalk/amazonresults/reviews_Electronics.json
    Output file size: 50
    

0 个答案:

没有答案