如何在mahout上运行DistributedLanczosSolver

时间:2014-03-20 17:37:47

标签: mahout

我正在尝试运行mahout的Lanczos示例。 我无法找到输入文件。什么应该是输入文件的格式。

我已经使用命令通过运行以下命令将.txt文件转换为序列文件格式:

bin/mahout seqdirectory -i input.txt -o outseq -c UTF-8
bin/mahout seq2sparse -i outseq -o ttseq

bin/hadoop jar mahout-examples-0.9-SNAPSHOT-job.jar org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input /user/hduser/outputseq --output /out1 --numCols 2 --numRows 4 --cleansvd "true" --rank 5

14/03/20 13:36:12 INFO lanczos.LanczosSolver: Finding 5 singular vectors of matrix with 4 rows, via Lanczos
14/03/20 13:36:13 INFO mapred.FileInputFormat: Total input paths to process : 7
Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
    at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:245)
    at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
    at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
    at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:152)
    at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:111)
    at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:237)
    ... 13 more

请问好吗?

1 个答案:

答案 0 :(得分:0)

在您的情况下,您正在执行 input.txt - > outseq - >的 ttseq

您使用 outputseq (但不是 outseq )作为输入来生成 out1

您在 ttseq 时遇到错误。那很奇怪?也许你错过了帖子中的一些步骤。


对我来说:

PASSES 文本文件 - > output-seqdir - >的输出seq2sparse-归

FAILS 文字档案 - > output-seqdir - > output-seq2sparse - >的输出seq2sparse-归

更多细节。

我在不同的情况下看到了这个错误:

创建序列文件

$ mahout seqdirectory -i /data/lda/text-files/ -o /data/lda/output-seqdir -c UTF-8
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 20:47:25 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/data/lda/ohsumed_full_txt/ohsumed_full_txt/], --keyPrefix=[], --output=[/data/lda/output], --startPhase=[0], --tempDir=[temp]}
14/03/24 20:57:20 INFO driver.MahoutDriver: Program took 594764 ms (Minutes: 9.912733333333334)

将序列文件转换为稀疏矢量。默认情况下使用TFIDF。

$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse/ -ow
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:00:08 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
14/03/24 21:00:10 INFO input.FileInputFormat: Total input paths to process : 1
14/03/24 21:00:11 INFO mapred.JobClient: Running job: job_201403241418_0001
.....
14/03/24 21:02:51 INFO driver.MahoutDriver: Program took 162906 ms (Minutes: 2.7151)

以下命令失败(使用/data/lda/output-seq2sparse作为输入)

$ mahout seq2sparse -i /data/lda/output-seq2sparse -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2  -ml 50 -seq -n 2 -nr 5
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/data/lda/output-seq2sparse/df-count/data
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
    at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
    ....SKIPPED....
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

然而,这很好(使用/data/lda/output-seqdir作为输入)

$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2  -ml 50 -seq -n 2 -nr 5
Running on hadoop, using .../hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ..../mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:35:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 2
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 50.0
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 5
14/03/24 21:35:57 INFO input.FileInputFormat: Total input paths to process : 1
...SKIPPED...
14/03/24 21:45:11 INFO common.HadoopUtil: Deleting /data/lda/output-seq2sparse-normalized/partial-vectors-0
14/03/24 21:45:11 INFO driver.MahoutDriver: Program took 556420 ms (Minutes: 9.273666666666667)