Question

我有一个应该

的程序

计算每个单词在语料库中出现的次数
获取计数阈值并将其用作Word2Vec培训的输入

以下程序是我完成此任务的目的。但是，我从日志文件中看到

model.save(sc, outputFilePath)

似乎在计数任务后立即执行。它似乎实际上不等待第二个任务完成。结果是没有模型的空目录。

  def main(args: Array[String]) {

    val inputFile = new File(args(0))
    val outputDirectory = new File(args(1))
    outputDirectory.mkdirs()
    val wordsToKeepCount = args(2).toInt    
    val conf = new SparkConf().setAppName("Word2VecOnCluster")    
    val sc = new SparkContext(conf)
    val file = sc.textFile(inputFile.getAbsolutePath)

    // Task 1: Count the occurrence of words

    val wordCounts = file
      .repartition(500)
      .mapPartitions(lineIterator => preprocessing(lineIterator))
      .flatMap(line => {
        line.split("\\s+").toSeq
      })
      .map((_, 1)).reduceByKey(_ + _)

    val wordCountsList = wordCounts.collect()

    Sorting.stableSort(wordCountsList, (a : (String, Int), b : (String, Int)) => {
      a._2 > b._2
    })


    // Task 2: Train word vectors and use
    //         wordCountsList(wordsToKeepCount)._2 as min-count
    //         for words

    val wordSequence = file
      .repartition(500)
      .mapPartitions( lineIterator => preprocessing(lineIterator))
      .map(line => {
        line.split("\\s+").toSeq
      })

    val word2vec = new Word2Vec()
    val model = word2vec
      .setMinCount(wordCountsList(wordsToKeepCount)._2)
      .setNumPartitions(20)
      .fit(wordSequence)

    val outputFilePath =  outputDirectory.getAbsolutePath + File.separator + inputFile.getName    
    val f = new File(outputFilePath)

    if(f.exists()) {
      FileUtils.deleteDirectory(f)
    }

    model.save(sc, outputFilePath)

    println("All done.")    
    System.exit(0)
  }

那么，这实际上是在做我想做的事情而其他事情是错误的，还是真的是model.save()被调用的问题，因为驱动程序不等待＆＃34;第二个任务＆＃34;

日志记录：

Log Type: directory.info
Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
Log Length: 5142
Showing 4096 bytes of 5142 total. Click here for the full log.
spark__.jar
35028667369    4 drwx------   2 sfalk    cluster      4096 Apr  6 18:03 ./__spark_conf__
35028667370    8 -r-x------   1 sfalk    cluster      7154 Apr  6 18:03 ./__spark_conf__/mapred-site.xml
35028667371    8 -r-x------   1 sfalk    cluster      5342 Apr  6 18:03 ./__spark_conf__/hadoop-env.sh
35028667372    4 -r-x------   1 sfalk    cluster       620 Apr  6 18:03 ./__spark_conf__/log4j.properties
35028668000    4 -r-x------   1 sfalk    cluster      1994 Apr  6 18:03 ./__spark_conf__/hadoop-metrics2.properties
35028668001   20 -r-x------   1 sfalk    cluster     19859 Apr  6 18:03 ./__spark_conf__/yarn-site.xml
35028668002    4 -r-x------   1 sfalk    cluster      3979 Apr  6 18:03 ./__spark_conf__/hadoop-env.cmd
35028668003    0 -r-x------   1 sfalk    cluster         0 Apr  6 18:03 ./__spark_conf__/yarn.exclude
35028668004    8 -r-x------   1 sfalk    cluster      5358 Apr  6 18:03 ./__spark_conf__/core-site.xml
35028668005    4 -r-x------   1 sfalk    cluster      1631 Apr  6 18:03 ./__spark_conf__/kms-log4j.properties
35028668006    4 -r-x------   1 sfalk    cluster      2250 Apr  6 18:03 ./__spark_conf__/yarn-env.cmd
35028668007    4 -r-x------   1 sfalk    cluster       884 Apr  6 18:03 ./__spark_conf__/ssl-client.xml
35028668008    4 -r-x------   1 sfalk    cluster      2313 Apr  6 18:03 ./__spark_conf__/capacity-scheduler.xml
35028668009    4 -r-x------   1 sfalk    cluster      3518 Apr  6 18:03 ./__spark_conf__/kms-acls.xml
35028668010    4 -r-x------   1 sfalk    cluster      2358 Apr  6 18:03 ./__spark_conf__/topology_script.py
35028668011    4 -r-x------   1 sfalk    cluster       758 Apr  6 18:03 ./__spark_conf__/mapred-site.xml.template
35028668039    4 -r-x------   1 sfalk    cluster      1335 Apr  6 18:03 ./__spark_conf__/configuration.xsl
35028668040    8 -r-x------   1 sfalk    cluster      5104 Apr  6 18:03 ./__spark_conf__/yarn-env.sh
35028668041   12 -r-x------   1 sfalk    cluster      8398 Apr  6 18:03 ./__spark_conf__/hdfs-site.xml
35028668042    4 -r-x------   1 sfalk    cluster      1020 Apr  6 18:03 ./__spark_conf__/commons-logging.properties
35028668043    4 -r-x------   1 sfalk    cluster      1033 Apr  6 18:03 ./__spark_conf__/container-executor.cfg
35028668044    8 -r-x------   1 sfalk    cluster      4221 Apr  6 18:03 ./__spark_conf__/task-log4j.properties
35028668045    4 -r-x------   1 sfalk    cluster      2490 Apr  6 18:03 ./__spark_conf__/hadoop-metrics.properties
35028668046    4 -r-x------   1 sfalk    cluster       856 Apr  6 18:03 ./__spark_conf__/mapred-env.sh
35028668047    4 -r-x------   1 sfalk    cluster      1602 Apr  6 18:03 ./__spark_conf__/health_check
35028668048    4 -r-x------   1 sfalk    cluster      2316 Apr  6 18:03 ./__spark_conf__/ssl-client.xml.example
35028668061    4 -r-x------   1 sfalk    cluster      1527 Apr  6 18:03 ./__spark_conf__/kms-env.sh
35028668062    4 -r-x------   1 sfalk    cluster      1308 Apr  6 18:03 ./__spark_conf__/hadoop-policy.xml
35028668063    4 -r-x------   1 sfalk    cluster       436 Apr  6 18:03 ./__spark_conf__/slaves
35028774647    4 -r-x------   1 sfalk    cluster      1084 Apr  6 18:03 ./__spark_conf__/topology_mappings.data
35028774648    4 -r-x------   1 sfalk    cluster       951 Apr  6 18:03 ./__spark_conf__/mapred-env.cmd
35028774651    4 -r-x------   1 sfalk    cluster      1000 Apr  6 18:03 ./__spark_conf__/ssl-server.xml
35028774652    4 -r-x------   1 sfalk    cluster      2268 Apr  6 18:03 ./__spark_conf__/ssl-server.xml.example
35028861113    4 -r-x------   1 sfalk    cluster       945 Apr  6 18:03 ./__spark_conf__/taskcontroller.cfg
35028861114    8 -r-x------   1 sfalk    cluster      5511 Apr  6 18:03 ./__spark_conf__/kms-site.xml
35028861115    8 -r-x------   1 sfalk    cluster      4113 Apr  6 18:03 ./__spark_conf__/mapred-queues.xml.template
35028861116    4 -r-x------   1 sfalk    cluster      1052 Apr  6 18:03 ./__spark_conf__/__spark_conf__.properties
42949879089 1249344 -r-x------   1 sfalk    cluster  1279326465 Apr  6 18:03 ./__app__.jar
broken symlinks(find -L . -maxdepth 5 -type l -ls):

Log Type: launch_container.sh
Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
Log Length: 5508
Showing 4096 bytes of 5508 total. Click here for the full log.
ger"
export CLASSPATH="$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/2.4.0.0-169/hadoop/lib/hadoop-lzo-0.6.0.2.4.0.0-169.jar:/etc/hadoop/conf/secure"
export SPARK_YARN_MODE="true"
export SPARK_YARN_CACHE_FILES_VISIBILITIES="PUBLIC,PRIVATE"
export HADOOP_TOKEN_FILE_LOCATION="/hadoop/hadoop/yarn/local/usercache/sfalk/appcache/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/container_tokens"
export NM_AUX_SERVICE_spark_shuffle=""
export SPARK_USER="sfalk"
export HOME="/home/"
export CONTAINER_ID="container_e05_1459928874908_0015_02_000001"
export MALLOC_ARENA_MAX="4"
ln -sf "/hadoop/hadoop/yarn/local/filecache/25/spark-hdp-assembly.jar" "__spark__.jar"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
ln -sf "/hadoop/hadoop/yarn/local/usercache/sfalk/filecache/13/__spark_conf__8544006202028925038.zip" "__spark_conf__"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
ln -sf "/hadoop/hadoop/yarn/local/usercache/sfalk/filecache/14/wordvectors-final.jar" "__app__.jar"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
# Creating copy of launch script
cp "launch_container.sh" "/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/launch_container.sh"
chmod 640 "/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/launch_container.sh"
# Determining directory contents
echo "ls -l:" 1>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
ls -l 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
find -L . -maxdepth 5 -ls 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
exec /bin/bash -c "$JAVA_HOME/bin/java -server -Xmx65536m -Djava.io.tmpdir=$PWD/tmp -Dhdp.version=2.4.0.0-169 -Dspark.yarn.app.container.log.dir=/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'masterthesis.code.wordvectors.Word2VecOnCluster2' --jar file:/home/sfalk/./deploy/masterthesis/code/wordvectors/final/wordvectors-final.jar --arg '/datasets/amazonreviews/reviews_Electronics.json' --arg '/user/sfalk/amazonresults' --arg '100000' --executor-memory 40960m --executor-cores 8 --properties-file $PWD/__spark_conf__/__spark_conf__.properties 1> /hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/stdout 2> /hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/stderr"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi

Log Type: stderr
Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
Log Length: 924945
Showing 4096 bytes of 924945 total. Click here for the full log.
extHandler{/api,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/04/06 19:01:38 INFO SparkUI: Stopped Spark web UI at http://192.168.0.109:37248
16/04/06 19:01:38 INFO YarnClusterSchedulerBackend: Shutting down all executors
16/04/06 19:01:38 INFO YarnClusterSchedulerBackend: Asking each executor to shut down
16/04/06 19:01:38 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
16/04/06 19:01:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/04/06 19:01:38 INFO MemoryStore: MemoryStore cleared
16/04/06 19:01:38 INFO BlockManager: BlockManager stopped
16/04/06 19:01:38 INFO BlockManagerMaster: BlockManagerMaster stopped
16/04/06 19:01:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/04/06 19:01:38 INFO SparkContext: Successfully stopped SparkContext
16/04/06 19:01:38 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master01.hadoop.know-center.at:8020/user/sfalk/amazonresults/reviews_Electronics.json/metadata already exists)
16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/04/06 19:01:38 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/04/06 19:01:38 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1459928874908_0015
16/04/06 19:01:38 INFO ShutdownHookManager: Shutdown hook called
16/04/06 19:01:38 INFO ShutdownHookManager: Deleting directory /hadoop/hadoop/yarn/local/usercache/sfalk/appcache/application_1459928874908_0015/spark-89fdf1b5-df2a-47e3-ae92-a98f4109c177

Log Type: stdout
Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
Log Length: 32332
Showing 4096 bytes of 32332 total. Click here for the full log.
: 77482
      Word: information count: 77091
      Word: across count: 77077
      ...
      Word: arm count: 62987
      Word: ac count: 62777
  Target count: 24 to keep 100000 words.
Keeping 100000 words.
Saving model to /user/sfalk/amazonresults/reviews_Electronics.json
Output file size: 50

这个程序是按照我假设的顺序执行的吗？

0 个答案: