Question

我正在将一系列90,000多个字符串分解为单独的，非重复的单词对的离散列表，这些单词包含在字符串中，其中rxcui id值与每个字符串相关联。我开发了一种尝试实现此目的的方法，但它产生了大量的冗余。对数据的分析表明，在清理和格式化字符串的内容之后，90,000多个源字符串中有大约12,000个唯一字。

如何更改下面的代码，以避免在目标2D ArrayList中创建冗余行（如下所示）？

    public static ArrayList<ArrayList<String>> getAllWords(String[] tempsArray){//int count = tempsArray.length;
        int fieldslenlessthan2 = 0;//ArrayList<String> outputarr = new ArrayList<String>();
        ArrayList<ArrayList<String>> twoDimArrayList= new ArrayList<ArrayList<String>>();
        int idx = 0;
        for (String s : tempsArray) {
            String[] fields = s.split("\t");//System.out.println(" --- fields.length is: "+fields.length);
            if(fields.length>1){
                ArrayList<String> row = new ArrayList<String>();
                System.out.println("fields[0] is: "+fields[0]);
                String cleanedTerms = cleanTerms(fields[1]);
                String[] words = cleanedTerms.split(" ");
                for(int j=0;j<words.length;j++){
                    String word=words[j].trim();
                    word = word.toLowerCase();
                    if(isValidWord(word)){//outputarr.add(word);
                        System.out.println("words["+j+"] is: "+word);
                        row.add(word_id);//WORD_ID NEEDS TO BE CREATED BY SOME METHOD.
                        row.add(fields[0]);
                        row.add(word);
                        twoDimArrayList.add(row);
                        idx += 1;
                    }
                }
            }else{fieldslenlessthan2 += 1;}
        }
        System.out.println("........... fieldslenlessthan2 is: "+fieldslenlessthan2);
        return twoDimArrayList;
    }

上述方法的输出目前如下所示，某些名称值有许多rxcui值，某些rxcui有很多名称值：

如何更改上面的代码，以便输出是一组唯一的名称/ rxcui值列表，总结当前输出中的所有相关数据，同时仅删除冗余？

Answer 1

如果您只需要所有单词的集合，请使用HashSet集合主要用于包含逻辑。如果您需要将值与字符串相关联，请使用HashMap

public HashSet<String> getUniqueWords(String[] stringArray) {
  HashSet<String> uniqueWords = new HashSet<String>();
  for (String str : stringArray) {
    uniqueWords.add(str);
  }
  return uniqueWords;
}

这将为您提供阵列中所有唯一字符串的集合。如果您需要ID，请使用HashMap

String[] strList; // your String array
int idCounter = 0;
HashMap<String, Integer> stringIDMap = new HashMap<String, Integer>();

for (String str : strList) {
  if (!stringIDMap.contains(str)) {
    stringIDMap.put(str, new Integer(idCounter));
    idCounter++;
  }
}

这将为您提供具有唯一字符串键和唯一Integer值的HashMap。要获取String的id，请执行以下操作： stringIDMap.get（＆＃34; myString的＆＃34）; //返回与String＆＃34; myString＆＃34;相关联的整数ID

<强>更新基于OP的问题更新。我建议创建一个包含String值和rxcui的对象。然后，您可以使用与上面提供的类似的实现将它们放在Set或HashMap中。

public MyObject(String str, int rxcui); // The constructor for your new object
MyObject mo1 = new MyObject("hello", 5);

无论

mySet.add(myObject);

将起作用或

myMap.put(mo1.getStr, mo1.getRxcui);

Answer 2

唯一字ID的目的是什么？这个词本身不够独特，因为你没有保留重复吗？

一种非常基本的方法是在检查新单词时保持计数器运行。对于尚不存在的每个单词，您可以增加计数器并将新值用作唯一ID。

最后，我建议您使用HashMap。它允许您在O（1）时间内插入和检索单词。我不完全确定你的目的是什么，但我认为HashMap可能会给你更多的范围。

<强> EDIT2：沿着这些方向会有更多的东西。这应该可以帮到你。

public static Set<DataPair> getAllWords(String[] tempsArray) {
    Set<DataPair> set = new HashSet<>();
    for (String row : tempsArray) {
        // PARSE YOUR STRING DATA
        // the way you were doing it seemed fine but something like this
        String[] rowArray = row.split(" ");
        String word = row[1];
        int id = Integer.parseInt(row[0]);
        DataPair pair = new DataPair(word, id);
        set.add(pair);
    }
    return set;
} 

class DataPair {
    private String word;
    private int id;

    public DataPair(String word, int id) {
        this.word = word;
        this.id = id;
    }

    public boolean equals(Object o) {
        if (o instanceof DataPair) {
            return ((DataPair) o).word.equals(word) && ((DataPair) o).id == id;
        }
        return false;
    }
}

在算法中分离唯一值

2 个答案: