Q工作不成功的mahout ssvd

时间:2012-07-09 15:19:46

标签: java mahout decomposition singular

我正在尝试在mahout中的某些tfidf-vectors上运行ssvd。当我在Java代码中运行它时如下(使用mahout 0.6罐),它工作正常:

public static void main(String[] args){
    runSSVDOnSparseVectors(vectorOutputPath
     + "/tfidf-vectors/part-r-00000", ssvdOutputPath, 1, 0, 30000, 1);
}

private static void runSSVDOnSparseVectors(String inputPath, String outputPath, 
                    int rank, int oversampling, int blocks,
                    int reduceTasks) throws IOException {
    Configuration conf = new Configuration();
    // get number of reduce tasks from config?
    SSVDSolver solver = new SSVDSolver(conf, new Path[] { new Path(
            inputPath) }, new Path(outputPath), blocks, rank, oversampling,
            reduceTasks);
    solver.setcUHalfSigma(true);
    solver.setcVHalfSigma(true);
    solver.run();
}

我决定将它转换为bash脚本而只是使用cli命令,但是当我这样做时,我得到以下错误(在版本0.5和0.7上试过这个,都没有用。我可以尝试0.6但是我不认为这是版本的事情):

[username@hostname lsa]$  $MAHOUT/mahout ssvd -i $H/test_lsa/v_out/tfidf-vectors -o $H/test_lsa/svd_out -k 1 -p 0 -r 30000 -t 1
Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jar
12/07/23 15:00:47 INFO common.AbstractJob: Command line arguments: {--abtBlockHeight=[200000], --blockHeight=[30000], --broadcast=[true], --computeU=[true], --computeV=[true], --endPhase=[2147483647], --input=[/path/to/folder/test_lsa/v_out/tfidf-vectors], --minSplitSize=[-1], --outerProdBlockHeight=[30000], --output=[/path/to/folder/test_lsa/svd_out], --oversampling=[0], --pca=[false], --powerIter=[0], --rank=[1], --reduceTasks=[100], --startPhase=[0], --tempDir=[temp], --uHalfSigma=[false], --vHalfSigma=[false]}
12/07/23 15:00:49 INFO input.FileInputFormat: Total input paths to process : 100
Exception in thread "main" java.io.IOException: Q job unsuccessful.
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:230)
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:377)
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.run(SSVDCli.java:141)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.main(SSVDCli.java:171)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

我在群集上以分布式模式运行它。我已经读过Q作业失败可能与块大小有关,但是我的大于p + k。我也意识到我正在使用一个可笑的小输入(4个向量),但就像我说的,它适用于java代码。我很困惑为什么它可以在java中工作但不在CLI中工作。我很确定我已经获得了函数的所有相同参数。我总是可以将java代码打包成一个jar并将其放入bash脚本中,但这很糟糕......

工作日志说:

2012-07-23 15:00:55,413 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2012-07-23 15:00:55,417 INFO org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6ce53220
2012-07-23 15:00:55,638 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2012-07-23 15:00:55,697 ERROR org.apache.mahout.common.IOUtils: new m can't be less than n
java.lang.IllegalArgumentException: new m can't be less than n
    at     org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.cleanup(QRFirstStep.java:233)
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.close(QRFirstStep.java:89)
    at org.apache.mahout.common.IOUtils.close(IOUtils.java:128)
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.cleanup(QJob.java:158)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.    apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-07-23 15:00:55,731 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-07-23 15:00:55,733 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalArgumentException: new m can't be less than n
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.cleanup(QRFirstStep.java:233)
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.close(QRFirstStep.java:89)
    at org.apache.mahout.common.IOUtils.close(IOUtils.java:128)
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.cleanup(QJob.java:158)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-07-23 15:00:55,736 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

我实际上认为这是因为tfidf-vectors中有一些序列文件是空的,因为我使用了太多的reducer。这对我来说似乎是个错误。