Question

以下是检测缩写及其长形式的代码。代码循环遍历文档中的一行，循环遍历该行的每个单词并标识首字母缩略词候选。然后它再次遍历文档的每一行，以找到缩写的适当长格式。我的问题是，如果一个首字母缩略词在文档中多次出现，我的输出包含它的多个实例。我只想用所有可能的长形式打印一个首字母缩略词。这是我的代码：

public static void main(String[] args) throws FileNotFoundException
    {
        BufferedReader in = new BufferedReader(new FileReader("D:\\Workspace\\resource\\SampleSentences.txt"));
        String str=null;
        ArrayList<String> lines = new ArrayList<String>();
        String matchingLongForm;
        List <String> matchingLongForms = new ArrayList<String>() ;
        List <String> shortForm = new ArrayList<String>() ;
        Map<String, List<String>> abbreviationPairs = new HashMap<String, List<String>>();


        try
        {
            while((str = in.readLine()) != null){
                lines.add(str);
            }
        }
        catch (IOException e)
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        String[] linesArray = lines.toArray(new String[lines.size()]);




        // document wide search for abbreviation long form and identifying several appropriate matches
        for (String line : linesArray){
            for (String word : (Tokenizer.getTokenizer().tokenize(line))){
                if (isValidShortForm(word)){
                    for (int i = 0; i < linesArray.length; i++){
                        matchingLongForm = extractBestLongForm(word, linesArray[i]);
                        //shortForm.add(word);
                        if (matchingLongForm != null && !(matchingLongForms.contains(matchingLongForm))){
                            matchingLongForms.add(matchingLongForm);

                            //System.out.println(matchingLongForm);
                            abbreviationPairs.put(word, matchingLongForms);
                            //matchingLongForms.clear();
                        }
                    } 

                    if (abbreviationPairs != null){
                        //for(abbreviationPairs.)
                        System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs);
                        abbreviationPairs.clear();
                        matchingLongForms.clear();
                        //System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairsNew);
                    }


                    else
                        continue;
                }
            }
        }
    }

这是当前的输出：

Abbreviation Pair:  {GLBA=[Gramm Leach Bliley act]} 
Abbreviation Pair:  {NCUA=[National credit union administration]} 
Abbreviation Pair:  {FFIEC=[Federal Financial Institutions Examination Council]}
Abbreviation Pair:  {CFR=[comments for the Report]} 
Abbreviation Pair:  {CFR=[comments for the Report]} 
Abbreviation Pair:  {CFR=[comments for the Report]} 
Abbreviation Pair:  {CFR=[comments for the Report]} 
Abbreviation Pair:  {OFAC=[Office of Foreign Assets Control]}

Answer 1

尝试使用java.util.Set存储匹配的简短表单和长表单。来自班级的javadoc：

...如果此set已包含该元素，则调用将保持set不变并返回false。结合对构造函数的限制，这可以确保集合永远不会包含重复元素......

Answer 2

您需要缩写和文本的键值对。所以你应该使用Map。地图不能包含重复的键;每个键最多可以映射一个值。

问题出在输出的位置，而不在地图中。您尝试在循环中输出，因此Map会多次显示。

将代码移到循环外：

if (abbreviationPairs != null){
     //for(abbreviationPairs.)
     System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs);
     abbreviationPairs.clear();
     matchingLongForms.clear();
     //System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairsNew);
}

Answer 3

以下是解决方案

感谢code_angel和Holger

将打印代码移到循环外部，并为每个matchingLongForm创建一个新列表。

for (String line : linesArray){
        for (String word : (Tokenizer.getTokenizer().tokenize(line))){
            if (isValidShortForm(word)){
                for (int i = 0; i < linesArray.length; i++){
                    matchingLongForm = extractBestLongForm(word, linesArray[i]);
                    List <String> matchingLongForms = new ArrayList<String>() ;
                    if (matchingLongForm != null && !(matchingLongForms.contains(matchingLongForm))&& !(abbreviationPairs.containsKey(word))){
                        matchingLongForms.add(matchingLongForm);
                        //System.out.println(matchingLongForm);
                        abbreviationPairs.put(word, matchingLongForms);
                        //matchingLongForms.clear();
                    }
                } 

            }
        }
    }
    if (abbreviationPairs != null){
        System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs);
        //abbreviationPairs.clear();
        //matchingLongForms.clear();

    }

}

新输出：

Abbreviation Pair:  {NCUA=[National credit union administration], FFIEC=[Federal Financial Institutions Examination Council], OFAC=[Office of Foreign Assets Control], MSSP=[Managed Security Service Providers], IS=[Information Systems], SLA=[Service level agreements], CFR=[comments for the Report], MIS=[Management Information Systems], IDS=[Intrusion detection systems], TSP=[Technology Service Providers], RFI=[risk that FIs], EIC=[Examples of in the cloud], TIER=[The institution should ensure], BCP=[Business continuity planning], GLBA=[Gramm Leach Bliley act], III=[It is important], FI=[Financial Institutions], RFP=[Request for proposal]}

删除映射中具有值的列表中的重复键值对

3 个答案: