我有一个应该
的程序以下程序是我完成此任务的目的。但是,我从日志文件中看到
model.save(sc, outputFilePath)
似乎在计数任务后立即执行。它似乎实际上不等待第二个任务完成。结果是没有模型的空目录。
def main(args: Array[String]) {
val inputFile = new File(args(0))
val outputDirectory = new File(args(1))
outputDirectory.mkdirs()
val wordsToKeepCount = args(2).toInt
val conf = new SparkConf().setAppName("Word2VecOnCluster")
val sc = new SparkContext(conf)
val file = sc.textFile(inputFile.getAbsolutePath)
// Task 1: Count the occurrence of words
val wordCounts = file
.repartition(500)
.mapPartitions(lineIterator => preprocessing(lineIterator))
.flatMap(line => {
line.split("\\s+").toSeq
})
.map((_, 1)).reduceByKey(_ + _)
val wordCountsList = wordCounts.collect()
Sorting.stableSort(wordCountsList, (a : (String, Int), b : (String, Int)) => {
a._2 > b._2
})
// Task 2: Train word vectors and use
// wordCountsList(wordsToKeepCount)._2 as min-count
// for words
val wordSequence = file
.repartition(500)
.mapPartitions( lineIterator => preprocessing(lineIterator))
.map(line => {
line.split("\\s+").toSeq
})
val word2vec = new Word2Vec()
val model = word2vec
.setMinCount(wordCountsList(wordsToKeepCount)._2)
.setNumPartitions(20)
.fit(wordSequence)
val outputFilePath = outputDirectory.getAbsolutePath + File.separator + inputFile.getName
val f = new File(outputFilePath)
if(f.exists()) {
FileUtils.deleteDirectory(f)
}
model.save(sc, outputFilePath)
println("All done.")
System.exit(0)
}
那么,这实际上是在做我想做的事情而其他事情是错误的,还是真的是model.save()
被调用的问题,因为驱动程序不等待"第二个任务& #34;
日志记录:
Log Type: directory.info
Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
Log Length: 5142
Showing 4096 bytes of 5142 total. Click here for the full log.
spark__.jar
35028667369 4 drwx------ 2 sfalk cluster 4096 Apr 6 18:03 ./__spark_conf__
35028667370 8 -r-x------ 1 sfalk cluster 7154 Apr 6 18:03 ./__spark_conf__/mapred-site.xml
35028667371 8 -r-x------ 1 sfalk cluster 5342 Apr 6 18:03 ./__spark_conf__/hadoop-env.sh
35028667372 4 -r-x------ 1 sfalk cluster 620 Apr 6 18:03 ./__spark_conf__/log4j.properties
35028668000 4 -r-x------ 1 sfalk cluster 1994 Apr 6 18:03 ./__spark_conf__/hadoop-metrics2.properties
35028668001 20 -r-x------ 1 sfalk cluster 19859 Apr 6 18:03 ./__spark_conf__/yarn-site.xml
35028668002 4 -r-x------ 1 sfalk cluster 3979 Apr 6 18:03 ./__spark_conf__/hadoop-env.cmd
35028668003 0 -r-x------ 1 sfalk cluster 0 Apr 6 18:03 ./__spark_conf__/yarn.exclude
35028668004 8 -r-x------ 1 sfalk cluster 5358 Apr 6 18:03 ./__spark_conf__/core-site.xml
35028668005 4 -r-x------ 1 sfalk cluster 1631 Apr 6 18:03 ./__spark_conf__/kms-log4j.properties
35028668006 4 -r-x------ 1 sfalk cluster 2250 Apr 6 18:03 ./__spark_conf__/yarn-env.cmd
35028668007 4 -r-x------ 1 sfalk cluster 884 Apr 6 18:03 ./__spark_conf__/ssl-client.xml
35028668008 4 -r-x------ 1 sfalk cluster 2313 Apr 6 18:03 ./__spark_conf__/capacity-scheduler.xml
35028668009 4 -r-x------ 1 sfalk cluster 3518 Apr 6 18:03 ./__spark_conf__/kms-acls.xml
35028668010 4 -r-x------ 1 sfalk cluster 2358 Apr 6 18:03 ./__spark_conf__/topology_script.py
35028668011 4 -r-x------ 1 sfalk cluster 758 Apr 6 18:03 ./__spark_conf__/mapred-site.xml.template
35028668039 4 -r-x------ 1 sfalk cluster 1335 Apr 6 18:03 ./__spark_conf__/configuration.xsl
35028668040 8 -r-x------ 1 sfalk cluster 5104 Apr 6 18:03 ./__spark_conf__/yarn-env.sh
35028668041 12 -r-x------ 1 sfalk cluster 8398 Apr 6 18:03 ./__spark_conf__/hdfs-site.xml
35028668042 4 -r-x------ 1 sfalk cluster 1020 Apr 6 18:03 ./__spark_conf__/commons-logging.properties
35028668043 4 -r-x------ 1 sfalk cluster 1033 Apr 6 18:03 ./__spark_conf__/container-executor.cfg
35028668044 8 -r-x------ 1 sfalk cluster 4221 Apr 6 18:03 ./__spark_conf__/task-log4j.properties
35028668045 4 -r-x------ 1 sfalk cluster 2490 Apr 6 18:03 ./__spark_conf__/hadoop-metrics.properties
35028668046 4 -r-x------ 1 sfalk cluster 856 Apr 6 18:03 ./__spark_conf__/mapred-env.sh
35028668047 4 -r-x------ 1 sfalk cluster 1602 Apr 6 18:03 ./__spark_conf__/health_check
35028668048 4 -r-x------ 1 sfalk cluster 2316 Apr 6 18:03 ./__spark_conf__/ssl-client.xml.example
35028668061 4 -r-x------ 1 sfalk cluster 1527 Apr 6 18:03 ./__spark_conf__/kms-env.sh
35028668062 4 -r-x------ 1 sfalk cluster 1308 Apr 6 18:03 ./__spark_conf__/hadoop-policy.xml
35028668063 4 -r-x------ 1 sfalk cluster 436 Apr 6 18:03 ./__spark_conf__/slaves
35028774647 4 -r-x------ 1 sfalk cluster 1084 Apr 6 18:03 ./__spark_conf__/topology_mappings.data
35028774648 4 -r-x------ 1 sfalk cluster 951 Apr 6 18:03 ./__spark_conf__/mapred-env.cmd
35028774651 4 -r-x------ 1 sfalk cluster 1000 Apr 6 18:03 ./__spark_conf__/ssl-server.xml
35028774652 4 -r-x------ 1 sfalk cluster 2268 Apr 6 18:03 ./__spark_conf__/ssl-server.xml.example
35028861113 4 -r-x------ 1 sfalk cluster 945 Apr 6 18:03 ./__spark_conf__/taskcontroller.cfg
35028861114 8 -r-x------ 1 sfalk cluster 5511 Apr 6 18:03 ./__spark_conf__/kms-site.xml
35028861115 8 -r-x------ 1 sfalk cluster 4113 Apr 6 18:03 ./__spark_conf__/mapred-queues.xml.template
35028861116 4 -r-x------ 1 sfalk cluster 1052 Apr 6 18:03 ./__spark_conf__/__spark_conf__.properties
42949879089 1249344 -r-x------ 1 sfalk cluster 1279326465 Apr 6 18:03 ./__app__.jar
broken symlinks(find -L . -maxdepth 5 -type l -ls):
Log Type: launch_container.sh
Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
Log Length: 5508
Showing 4096 bytes of 5508 total. Click here for the full log.
ger"
export CLASSPATH="$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/2.4.0.0-169/hadoop/lib/hadoop-lzo-0.6.0.2.4.0.0-169.jar:/etc/hadoop/conf/secure"
export SPARK_YARN_MODE="true"
export SPARK_YARN_CACHE_FILES_VISIBILITIES="PUBLIC,PRIVATE"
export HADOOP_TOKEN_FILE_LOCATION="/hadoop/hadoop/yarn/local/usercache/sfalk/appcache/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/container_tokens"
export NM_AUX_SERVICE_spark_shuffle=""
export SPARK_USER="sfalk"
export HOME="/home/"
export CONTAINER_ID="container_e05_1459928874908_0015_02_000001"
export MALLOC_ARENA_MAX="4"
ln -sf "/hadoop/hadoop/yarn/local/filecache/25/spark-hdp-assembly.jar" "__spark__.jar"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
exit $hadoop_shell_errorcode
fi
ln -sf "/hadoop/hadoop/yarn/local/usercache/sfalk/filecache/13/__spark_conf__8544006202028925038.zip" "__spark_conf__"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
exit $hadoop_shell_errorcode
fi
ln -sf "/hadoop/hadoop/yarn/local/usercache/sfalk/filecache/14/wordvectors-final.jar" "__app__.jar"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
exit $hadoop_shell_errorcode
fi
# Creating copy of launch script
cp "launch_container.sh" "/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/launch_container.sh"
chmod 640 "/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/launch_container.sh"
# Determining directory contents
echo "ls -l:" 1>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
ls -l 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
find -L . -maxdepth 5 -ls 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/directory.info"
exec /bin/bash -c "$JAVA_HOME/bin/java -server -Xmx65536m -Djava.io.tmpdir=$PWD/tmp -Dhdp.version=2.4.0.0-169 -Dspark.yarn.app.container.log.dir=/hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'masterthesis.code.wordvectors.Word2VecOnCluster2' --jar file:/home/sfalk/./deploy/masterthesis/code/wordvectors/final/wordvectors-final.jar --arg '/datasets/amazonreviews/reviews_Electronics.json' --arg '/user/sfalk/amazonresults' --arg '100000' --executor-memory 40960m --executor-cores 8 --properties-file $PWD/__spark_conf__/__spark_conf__.properties 1> /hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/stdout 2> /hadoop/hadoop/yarn/log/application_1459928874908_0015/container_e05_1459928874908_0015_02_000001/stderr"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
exit $hadoop_shell_errorcode
fi
Log Type: stderr
Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
Log Length: 924945
Showing 4096 bytes of 924945 total. Click here for the full log.
extHandler{/api,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/04/06 19:01:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/04/06 19:01:38 INFO SparkUI: Stopped Spark web UI at http://192.168.0.109:37248
16/04/06 19:01:38 INFO YarnClusterSchedulerBackend: Shutting down all executors
16/04/06 19:01:38 INFO YarnClusterSchedulerBackend: Asking each executor to shut down
16/04/06 19:01:38 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
16/04/06 19:01:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/04/06 19:01:38 INFO MemoryStore: MemoryStore cleared
16/04/06 19:01:38 INFO BlockManager: BlockManager stopped
16/04/06 19:01:38 INFO BlockManagerMaster: BlockManagerMaster stopped
16/04/06 19:01:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/04/06 19:01:38 INFO SparkContext: Successfully stopped SparkContext
16/04/06 19:01:38 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master01.hadoop.know-center.at:8020/user/sfalk/amazonresults/reviews_Electronics.json/metadata already exists)
16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/04/06 19:01:38 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
16/04/06 19:01:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/04/06 19:01:38 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1459928874908_0015
16/04/06 19:01:38 INFO ShutdownHookManager: Shutdown hook called
16/04/06 19:01:38 INFO ShutdownHookManager: Deleting directory /hadoop/hadoop/yarn/local/usercache/sfalk/appcache/application_1459928874908_0015/spark-89fdf1b5-df2a-47e3-ae92-a98f4109c177
Log Type: stdout
Log Upload Time: Wed Apr 06 19:01:40 +0200 2016
Log Length: 32332
Showing 4096 bytes of 32332 total. Click here for the full log.
: 77482
Word: information count: 77091
Word: across count: 77077
...
Word: arm count: 62987
Word: ac count: 62777
Target count: 24 to keep 100000 words.
Keeping 100000 words.
Saving model to /user/sfalk/amazonresults/reviews_Electronics.json
Output file size: 50