Question

我正在尝试使用Mahout在Amazon EMR上运行群集作业。我有一个我在S3上传的solr索引，我想使用mahouts lucene.vector对它进行矢量化。（这是作业流程的第一步）

该步骤的参数是：

Jar：s3n：//mahout-bucket/jars/mahout-core-0.6-job.jar
MainClass：org.apache.mahout.driver.MahoutDriver
Args：lucene.vector --dir s3n：// mahout-input / solr_index / - field name --dictOut /test/solr-dict-out/dict.txt --output / test / solr-vectors-出/载体

日志中的错误是：

选择了未知程序'lucene.vector'。

我已经在hadoop和Mahout本地完成了同样的过程，并且工作正常。我该如何在EMR上调用lucene.vector函数？

Answer 1

程序名，lucene.vector应该在bin / mahout之后立即

/ homes / cuneyt / trunk / bin / mahout lucene.vector --dir / homes / cuneyt / lucene / index --field 0 --output lda / vector --dictOut /homes/cuneyt/lda/dict.txt

Answer 2

我最终找到了答案。问题是我使用了错误的MainClass参数。而不是

org.apache.mahout.driver.MahoutDriver

我应该使用：

org.apache.mahout.utils.vectors.lucene.Driver

因此正确的论据应该是

Jar：s3n：//mahout-bucket/jars/mahout-core-0.6-job.jar MainClass：
org.apache.mahout.utils.vectors.lucene.Driver
Args：--dir s3n：// mahout-input / solr_index / - field name --dictOut /test/solr-dict-out/dict.txt --output / test / solr-vectors-out / vectors

使用lucene.vector使用mahout矢量化solr索引

2 个答案: