Question

我正在使用Mahout 0.9（安装在HDP 2.2上）进行主题发现（Latent Drichlet Allocation算法）。我的文本文件存储在目录中 inputraw并按顺序执行以下命令

命令＃1：

mahout seqdirectory -i inputraw -o output-directory -c UTF-8

命令＃2：

mahout seq2sparse -i output-directory -o output-vector-str -wt tf -ng 3 --maxDFPercent 40 -ow -nv

命令＃3：

mahout rowid -i output-vector-str/tf-vectors/ -o output-vector-int

命令＃4：

mahout cvb -i output-vector-int/matrix -o output-topics -k 1 -mt output-tmp -x 10 -dict output-vector-str/dictionary.file-0

执行第二个命令后，正如预期的那样，它会创建一堆子文件夹和文件 output-vector-str（名为df-count，dictionary.file-0，frequency.file-0，tf-vectors，tokenized-documents和wordcount）。考虑到我的输入文件的大小，这些文件的大小看起来都很好但是``tf-vectors`下的文件的大小非常小，实际上它只有118个字节。）

显然是

`tf-vectors` is the input to the 3rd command, the third command also generates a file of small size. Does anyone know:

下文件的原因是什么

`tf-vectors` folder to be that small? There must be something wrong.

从第一个命令开始，所有生成的文件都有一个奇怪的编码，也不是人类可读的。这是预期的吗？

Answer 1

您的答案如下：

tf-vectors文件夹下文件的原因是什么？

考虑到你给出的maxdf百分比仅为40％，这些向量很小，这意味着只考虑具有doc freq（在整个文档中出现的术语百分比频率）小于40％的术语。换句话说，在生成向量时，只考虑40％或更少文档中出现的术语。

tf-vectors文件夹下文件的原因是什么？

mahout中有一个名为mahout seqdumper的命令可以帮助您将文件转储到＆＃34;顺序＆＃34;格式化为＆＃34;人类＆＃34;可读格式。祝你好运!!

在Mahout中从文本创建向量的问题

1 个答案: