Question

我正在尝试使用Weka获取文档群集。这个过程是一个更大的管道的一部分，我真的不能写出arff文件。我将每个文档中的所有文档和单词包都作为Map<String, Multiset<String>>结构，其中键是文档名称，Multiset<String>值是 bag 中的单词文件。我有两个问题：

（1）当前的方法最终会聚集术语，而不是文档：

public final Instances buildDocumentInstances(TreeMap<String, Multiset<String>> docToTermsMap, String encoding) throws IOException {
    int dimension = TermToDocumentFrequencyMap.navigableKeySet().size();
    FastVector attributes = new FastVector(dimension);
    for (String s : TermToDocumentFrequencyMap.navigableKeySet()) attributes.addElement(new Attribute(s));
    List<Instance> instances = Lists.newArrayList();
    for (Map.Entry<String, Multiset<String>> entry : docToTermsMap.entrySet()) {
        Instance instance = new Instance(dimension);
        for (Multiset.Entry<String> ms_entry : entry.getValue().entrySet()) {
            Integer index = TermToIndexMap.get(ms_entry.getElement());
            if (index != null)
                switch (encoding) {
                case "tf":
                    instance.setValue(index, ms_entry.getCount());
                    break;
                case "binary":
                    instance.setValue(index, ms_entry.getCount() > 0 ? 1 : 0);
                    break;
                case "tfidf":
                    double tf = ms_entry.getCount();
                    double df = TermToDocumentFrequencyMap.get(ms_entry.getElement());
                    double idf = Math.log(TermToIndexMap.size() / df);
                    instance.setValue(index, tf * idf);
                    break;
                }
        }
        instances.add(instance);
    }
    Instances dataset = new Instances("My Dataset Name", attributes, instances.size());
    for (Instance instance : instances) dataset.add(instance);
    return dataset;
}

我正在尝试创建单个Instance对象，然后通过将它们添加到Instances对象来创建数据集。每个实例都是文档向量（具有0/1，tf或tf-idf编码）。此外，每个单词都是一个单独的属性。但是当我运行SimpleKMeans#buildClusterer时，输出显示它正在聚集单词，而不是文档。我显然做了一些可怕的错误，但我无法弄清楚那是什么错误。

（2）如何在这种情况下使用StringToWordVector？ 在我看过的每个地方，人们建议使用weka.filters.unsupervised.attribute.StringToWordVector来聚类文档。但是，我找不到任何可以使用它的方式，我可以从我的文档中获取单词 - ＆gt;词袋结构。 [注意：在我的情况下，它是Map<String, Multiset<String>，但这不是一个严格的要求。如果StringToWordVector需要，我可以将其转换为其他数据结构。]

在Weka中使用StringToWordVector和内部数据结构

0 个答案: