Question

我正在尝试在java中创建一个.arff编写器。下面的代码是使用的代码。数据集包含分类值，在标题中用*表示。

标题与数据部分分开。首先，我提取所有分类列的columnID。我现在需要帮助提取每个分类列的唯一值。

public void datatoARFF(String nameARFF, int sizeFeatureSubset, datasetRead ob, String featureNames, int[] subsets) {
    PrintWriter writer = null;
    try {
        double[][] secsData = ob.getSecSet();
        ArrayList<String> currentCatSubs = new ArrayList<String>();
        //this method will create the arff file from the datafile stored for the feature subset.
        // Create new .csv file and store in SUBMIT folder
        File file = new File(nameARFF);
        writer = new PrintWriter(file);

        // Writes the Header
        writer.print("@relation '" + nameARFF + "'");
        writer.println();

        // Load featurenames and subsets into arrays.
        featureNames = featureNames.replaceAll(" ", "");
        featureNames = featureNames.replaceAll(".*\\{|\\}.*", "");
        String[] featureNamesArr = featureNames.split(",", -1);

        // Identifies the categorical values column ids as currentCatSubs
        for (int featureR = 0; featureR < sizeFeatureSubset; featureR++) {
            if (featCat(featureNamesArr[featureR], '*') != true) {
                //do nothing
            } else {
                //add to arraylist if categorical
                currentCatSubs.add(Integer.toString(featureR));
            }
        }
        currentCatSubs.add(Integer.toString(sizeFeatureSubset));

        // ******** NEED HELP HERE TO NOW EXTRACT ALL DISTINC VALUES FROM THE COLUMNS IDENTIFIED IN CURRENTCATSUBS AND THEN WRITE THEM BELOW. ********

        // Writes header with categorical values
        for (int feats = 0; feats < sizeFeatureSubset; feats++) {
            if (featCat(featureNamesArr[feats], '*') != true) {
                writer.println("@attribute '" + featureNamesArr[feats] + "' UNIQUE CATEGORIES FOR FEATURE");
            } else {
                writer.println("@attribute '" + featureNamesArr[feats] + "' real");
                currentCatSubs.add(Integer.toString(feats));
            }
        }
        // Writes the data into file
        writer.println("@attribute 'catClass' {0,1}");
        writer.println("@data");

        for (int row = 0; row < secsData.length; row++) {

            for (int col = 0; col < secsData[row].length; col++) {

                if (currentCatSubs.contains(Integer.toString(col))) {

                    writer.print((int) secsData[row][col]);
                } else {

                    writer.print(secsData[row][col]);
                }

                if (col < secsData[row].length - 1) writer.print(",");
            }

            writer.println();
        }
        writer.close();
    } catch (FileNotFoundException ex) {
        Logger.getLogger(WekaFormatter.class.getName()).log(Level.SEVERE, null, ex);
    } finally {
        writer.close();
    }
}

//Method to identify if feature is categorical or not based on * in name.
public boolean featCat(String str, char chr) {
    return str.indexOf(chr) != -1;
}

Answer 1

如果您想保留订单并获取唯一值，您也可以使用Java 8或9.我的建议是在您的代码中使用此行代码使featureNamesArr的内容唯一：

List<String> featureNamesDistinct = Stream.of(featureNamesArr).distinct().collect(Collectors.toList());

获取要素名称的不同值。

然后，您需要遍历distinct数组而不是featureNamesArr：

// Identifies the categorical values column ids as currentCatSubs
for (int featureR = 0; featureR < sizeFeatureSubset && featureR < featureNamesDistinct.size(); featureR++) {
    if (featCat(featureNamesDistinct.get(featureR), '*')) {
        //add to arraylist if categorical
        currentCatSubs.add(Integer.toString(featureR));
    }
}

Answer 2

首先，我提取所有分类列的columnID。我现在需要帮助提取每个分类列的唯一值。

当你使用关键字唯一值时，你可以使用HashSet 参考：https://www.tutorialspoint.com/java/java_hashset_class.htm

如果您没有内存问题，可以创建HashSet以映射每个分类列。

喜欢代码

import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

public class ExtractUniqueValue {

    public static void main(String[] args) {
        int test[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 9 };// 9 appears twice
        ArrayList<String> currentCatSubs = new ArrayList<String>();
        Map<String, Set> HashSets = new HashMap<String, Set>();
        for (int i = 0; i < test.length; i++) {
            if (test[i] < 5) {
                // do nothing
            } else {
                // add to arraylist if categorical
                currentCatSubs.add(Integer.toString(test[i]));
                if (HashSets.get(Integer.toString(test[i])) == null) {
                    HashSets.put(Integer.toString(test[i]), new HashSet());
                }
                Set hs = HashSets.get(Integer.toString(test[i]));
                hs.add(Integer.toString(test[i]));//because hashset will not add repeat value, u will get unique value.
            }
        }

        for (String key : HashSets.keySet()) {
            System.out.println("key=" + key);
            Set hs = HashSets.get(key);
            System.out.println("unique values=" + hs);
        }

    }
}

输出

key=5
unique values=[5]
key=6
unique values=[6]
key=7
unique values=[7]
key=8
unique values=[8]
key=9
unique values=[9] // 9 only appears once

Java从2d数组

2 个答案: