Question

我需要使用SparseInstance对象迭代扩展weka ARFF文件。每次添加新的SparseInstance时，标头可能会更改，因为新实例可能会添加其他属性。我认为mergeInstances方法可以解决我的问题，但事实并非如此。它要求两个数据集都没有共享属性。

如果不是绝对清楚，请查看以下示例：

Dataset1
a b c
1 2 3
4 5 6

Dataset2
c d
7 8

Merged result:
a b c d
1 2 3 ?
4 5 6 ?
? ? 7 8

我目前看到的唯一解决方案是手动解析arff文件并使用String处理合并它。有谁知道更好的解决方案？

Answer 1

确定。我自己找到了解决方案。解决方案的核心部分是方法Instances#insertAttributeAt，如果第二个参数为model.numAttributes()，则会将新属性作为最后一个属性插入。以下是数字属性的一些示例代码。它也很容易适应其他类型的属性：

    Map<String,String> currentInstanceFeatures = currentInstance.getFeatures();
    Instances model = null;
    try {
        if (targetFile.exists()) {
            FileReader in = new FileReader(targetFile);
            try {
                BufferedReader reader = new BufferedReader(in);
                ArffReader arff = new ArffReader(reader);
                model = arff.getData();
            } finally {
                IOUtils.closeQuietly(in);
            }
        } else {
            FastVector schema = new FastVector();
            model = new Instances("model", schema, 1);
        }
        Instance newInstance = new SparseInstance(0);
        newInstance.setDataset(model);

        for(Map.Entry<String,String> feature:currentInstanceFeatures.entrySet()) {
            Attribute attribute = model.attribute(feature.getKey());
                if (attribute == null) {
                    attribute = new Attribute(feature.getKey());
                    model.insertAttributeAt(attribute, model.numAttributes());
                    attribute = model.attribute(feature.getKey());
                }
            newInstance.setValue(attribute, feature.getValue());
        }

        model.add(newInstance);
        model.compactify();
        ArffSaver saver = new ArffSaver();
        saver.setInstances(model);
        saver.setFile(targetFile);
        LOGGER.debug("Saving dataset to: " + targetFile.getAbsoluteFile());
        saver.writeBatch();
    } catch (IOException e) {
        throw new IllegalArgumentException(e);
    }

合并具有不同但重叠的模式和不同实例数量的两个稀疏weka数据集

1 个答案: