我正在尝试在java中创建一个.arff编写器。下面的代码是使用的代码。 数据集包含分类值,在标题中用*表示。
标题与数据部分分开。 首先,我提取所有分类列的columnID。我现在需要帮助提取每个分类列的唯一值。
public void datatoARFF(String nameARFF, int sizeFeatureSubset, datasetRead ob, String featureNames, int[] subsets) {
PrintWriter writer = null;
try {
double[][] secsData = ob.getSecSet();
ArrayList<String> currentCatSubs = new ArrayList<String>();
//this method will create the arff file from the datafile stored for the feature subset.
// Create new .csv file and store in SUBMIT folder
File file = new File(nameARFF);
writer = new PrintWriter(file);
// Writes the Header
writer.print("@relation '" + nameARFF + "'");
writer.println();
// Load featurenames and subsets into arrays.
featureNames = featureNames.replaceAll(" ", "");
featureNames = featureNames.replaceAll(".*\\{|\\}.*", "");
String[] featureNamesArr = featureNames.split(",", -1);
// Identifies the categorical values column ids as currentCatSubs
for (int featureR = 0; featureR < sizeFeatureSubset; featureR++) {
if (featCat(featureNamesArr[featureR], '*') != true) {
//do nothing
} else {
//add to arraylist if categorical
currentCatSubs.add(Integer.toString(featureR));
}
}
currentCatSubs.add(Integer.toString(sizeFeatureSubset));
// ******** NEED HELP HERE TO NOW EXTRACT ALL DISTINC VALUES FROM THE COLUMNS IDENTIFIED IN CURRENTCATSUBS AND THEN WRITE THEM BELOW. ********
// Writes header with categorical values
for (int feats = 0; feats < sizeFeatureSubset; feats++) {
if (featCat(featureNamesArr[feats], '*') != true) {
writer.println("@attribute '" + featureNamesArr[feats] + "' UNIQUE CATEGORIES FOR FEATURE");
} else {
writer.println("@attribute '" + featureNamesArr[feats] + "' real");
currentCatSubs.add(Integer.toString(feats));
}
}
// Writes the data into file
writer.println("@attribute 'catClass' {0,1}");
writer.println("@data");
for (int row = 0; row < secsData.length; row++) {
for (int col = 0; col < secsData[row].length; col++) {
if (currentCatSubs.contains(Integer.toString(col))) {
writer.print((int) secsData[row][col]);
} else {
writer.print(secsData[row][col]);
}
if (col < secsData[row].length - 1) writer.print(",");
}
writer.println();
}
writer.close();
} catch (FileNotFoundException ex) {
Logger.getLogger(WekaFormatter.class.getName()).log(Level.SEVERE, null, ex);
} finally {
writer.close();
}
}
//Method to identify if feature is categorical or not based on * in name.
public boolean featCat(String str, char chr) {
return str.indexOf(chr) != -1;
}
答案 0 :(得分:0)
如果您想保留订单并获取唯一值,您也可以使用Java 8或9.我的建议是在您的代码中使用此行代码使featureNamesArr
的内容唯一:
List<String> featureNamesDistinct = Stream.of(featureNamesArr).distinct().collect(Collectors.toList());
获取要素名称的不同值。
然后,您需要遍历distinct数组而不是featureNamesArr
:
// Identifies the categorical values column ids as currentCatSubs
for (int featureR = 0; featureR < sizeFeatureSubset && featureR < featureNamesDistinct.size(); featureR++) {
if (featCat(featureNamesDistinct.get(featureR), '*')) {
//add to arraylist if categorical
currentCatSubs.add(Integer.toString(featureR));
}
}
答案 1 :(得分:-1)
首先,我提取所有分类列的columnID。我现在需要帮助提取每个分类列的唯一值。
当你使用关键字唯一值时,你可以使用HashSet 参考:https://www.tutorialspoint.com/java/java_hashset_class.htm
如果您没有内存问题,可以创建HashSet以映射每个分类列。
喜欢代码
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
public class ExtractUniqueValue {
public static void main(String[] args) {
int test[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 9 };// 9 appears twice
ArrayList<String> currentCatSubs = new ArrayList<String>();
Map<String, Set> HashSets = new HashMap<String, Set>();
for (int i = 0; i < test.length; i++) {
if (test[i] < 5) {
// do nothing
} else {
// add to arraylist if categorical
currentCatSubs.add(Integer.toString(test[i]));
if (HashSets.get(Integer.toString(test[i])) == null) {
HashSets.put(Integer.toString(test[i]), new HashSet());
}
Set hs = HashSets.get(Integer.toString(test[i]));
hs.add(Integer.toString(test[i]));//because hashset will not add repeat value, u will get unique value.
}
}
for (String key : HashSets.keySet()) {
System.out.println("key=" + key);
Set hs = HashSets.get(key);
System.out.println("unique values=" + hs);
}
}
}
输出
key=5
unique values=[5]
key=6
unique values=[6]
key=7
unique values=[7]
key=8
unique values=[8]
key=9
unique values=[9] // 9 only appears once