我正在尝试运行mahout的Lanczos示例。 我无法找到输入文件。什么应该是输入文件的格式。
我已经使用命令通过运行以下命令将.txt文件转换为序列文件格式:
bin/mahout seqdirectory -i input.txt -o outseq -c UTF-8
bin/mahout seq2sparse -i outseq -o ttseq
bin/hadoop jar mahout-examples-0.9-SNAPSHOT-job.jar org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input /user/hduser/outputseq --output /out1 --numCols 2 --numRows 4 --cleansvd "true" --rank 5
14/03/20 13:36:12 INFO lanczos.LanczosSolver: Finding 5 singular vectors of matrix with 4 rows, via Lanczos
14/03/20 13:36:13 INFO mapred.FileInputFormat: Total input paths to process : 7
Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:245)
at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:152)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:111)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:237)
... 13 more
请问好吗?
答案 0 :(得分:0)
在您的情况下,您正在执行 input.txt - > outseq - >的 ttseq 强>
您使用 outputseq (但不是 outseq )作为输入来生成 out1 。
您在 ttseq 时遇到错误。那很奇怪?也许你错过了帖子中的一些步骤。
对我来说:
PASSES :文本文件 - > output-seqdir - >的输出seq2sparse-归强>
FAILS :文字档案 - > output-seqdir - > output-seq2sparse - >的输出seq2sparse-归强>
更多细节。
我在不同的情况下看到了这个错误:
创建序列文件
$ mahout seqdirectory -i /data/lda/text-files/ -o /data/lda/output-seqdir -c UTF-8
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 20:47:25 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/data/lda/ohsumed_full_txt/ohsumed_full_txt/], --keyPrefix=[], --output=[/data/lda/output], --startPhase=[0], --tempDir=[temp]}
14/03/24 20:57:20 INFO driver.MahoutDriver: Program took 594764 ms (Minutes: 9.912733333333334)
将序列文件转换为稀疏矢量。默认情况下使用TFIDF。
$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse/ -ow
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:00:08 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
14/03/24 21:00:10 INFO input.FileInputFormat: Total input paths to process : 1
14/03/24 21:00:11 INFO mapred.JobClient: Running job: job_201403241418_0001
.....
14/03/24 21:02:51 INFO driver.MahoutDriver: Program took 162906 ms (Minutes: 2.7151)
以下命令失败(使用/data/lda/output-seq2sparse
作为输入)
$ mahout seq2sparse -i /data/lda/output-seq2sparse -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 -nr 5
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/data/lda/output-seq2sparse/df-count/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
....SKIPPED....
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
然而,这很好(使用/data/lda/output-seqdir
作为输入)
$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 -nr 5
Running on hadoop, using .../hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ..../mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:35:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 2
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 50.0
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 5
14/03/24 21:35:57 INFO input.FileInputFormat: Total input paths to process : 1
...SKIPPED...
14/03/24 21:45:11 INFO common.HadoopUtil: Deleting /data/lda/output-seq2sparse-normalized/partial-vectors-0
14/03/24 21:45:11 INFO driver.MahoutDriver: Program took 556420 ms (Minutes: 9.273666666666667)