确定字符串是否是文本中的专有名词

时间:2014-12-05 23:33:10

标签: java string arraylist

我正在尝试解析文本(http://pastebin.com/raw.php?i=0wD91r2i)并检索单词及其出现次数。但是,我不能在最终输出中包含专有名词。我不太确定如何完成这项任务。

我对此的尝试

public class TextAnalysis
{
    public static void main(String[] args)
    {
        ArrayList<Word> words = new ArrayList<Word>(); //instantiate array list of object Word
        try
        {
            int lineCount = 0; 
            int wordCount = 0;
            int specialWord = 0;
            URL reader = new URL("http://pastebin.com/raw.php?i=0wD91r2i");
            Scanner in = new Scanner(reader.openStream());
            while(in.hasNextLine()) //while to parse text
            {
                lineCount++;
                String textInfo[] = in.nextLine().replaceAll("[^a-zA-Z ]", "").split("\\s+"); //use regex to replace all punctuation with empty char and split words with white space chars in between
                wordCount += textInfo.length; 
                for(int i=0; i<textInfo.length; i++)
                {
                    if(textInfo[i].toLowerCase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches any special word case, add count of special words then continue to next word
                    {
                        specialWord++;
                        continue;
                    }
                    if(!textInfo[i].matches(".*\\w.*")) continue; //also if text matches white space then continue
                    boolean found = false;
                    for(Word word: words) //check whether word already exists in list -- if so add count
                    {
                        if(word.getWord().equals(textInfo[i]))
                        {
                            word.addOccurence(1);
                            word.addLine(lineCount);
                            found = true;
                        }
                    }
                    if(!found) //else add new entry
                    {
                        words.add(new Word(textInfo[i], lineCount, 1));
                    }
                }
            }
            //adds data from capital word to lowercase word ATTEMPT AT PROPER NOUNS HERE
            for(Word word: words)
            {
                for(int i=0; i<words.size(); i++)
                {
                    if(Character.isUpperCase(word.getWord().charAt(0)) && word.getWord().toLowerCase().equals(words.get(i).getWord()))
                    {
                        words.get(i).addOccurence(word.getOccurence());
                        words.get(i).addLine(word.getLine());
                    }
                }
            }

            Comparator<Word> occurenceComparator = new Comparator<Word>() //comparares list based on number of occurences
            {
                public int compare(Word n1, Word n2)
                {
                    if(n1.getOccurence() < n2.getOccurence()) return 1;
                    else if (n1.getOccurence() == n2.getOccurence()) return 0;
                    else return -1;
                }
            };
            Collections.sort(words);
            // Collections.sort(words, occurenceComparator);
            // ArrayList<Word> top_words = new ArrayList<Word>(words.subList(0,100));
            // Collections.sort(top_words);
            System.out.printf("%-15s%-15s%s\n", "Word", "Occurences", "Word Distribution Index");
            for(Word word: words)
            {
                word.setTotalLine(lineCount);
                System.out.println(word);
            }
            System.out.println(wordCount);
            System.out.printf("%s%.3f\n","The connecting word index is ",specialWord*100.0/wordCount);
        }
        catch(IOException ex)
        {
            System.out.println("WEB URL NOT FOUND");
        }
    }
}

格式化关闭,不确定如何正确执行。

确定单词是否大写,如果单词的小写版本,则将数据添加到小写单词。但是,这并未考虑到小写版本永远不会出现的单词,例如文本中的“四”或“现在”。如果不交叉引用字典,我怎么能这样做呢?

编辑:我已经解决了问题我自己。

但是,感谢Wes试图回答。

1 个答案:

答案 0 :(得分:1)

似乎你的算法假设任何看似大写但没有出现非大都化的词是一个专有名词。因此,如果是这种情况,那么您可以使用以下算法来获取正确的名词。

//Assume you have tokenized your whole file into a Collection called allWords.
HashSet<String> lowercaseWords = new HashSet<>();
HashMap<String,String> lowerToCap = new HashMap<>();
for(String word: allWords) {
    if (Character.isUpperCase(word.charAt(0))){
        lowerToCap.put(word.toLowerCase(),word);
    }
    else {    
        lowercaseWords.add(word.toLowerCase);
    }
}

//remove all the words that we've found as capitalized, only proper nouns will be left
lowercaseWords.removeAll(lowerToCap.keySet());
for(String properNounLower:lowercaseWords) {
    System.out.println("Proper Noun: "+ lowerToCap.get(properNounLower));
}