
时间:2010-09-04 03:11:38

标签: java algorithm data-structures text word-boundary



Vulcans are a humanoid species in the fictional "Star Trek" universe who evolved on the planet Vulcan and are noted for their attempt to live by reason and logic with no interference from emotion They were the first extraterrestrial species officially to make first contact with Humans and later became one of the founding members of the "United Federation of Planets"

3 个答案:

答案 0 :(得分:1)



  • 我认为使用WordNet可以工作(不知道bigrams / trigrams会在哪里进入),但你应该将WordNet查找视为混合系统的一部分,而不是发现命名实体的全部和最终结果
  • 然后,首先应用一些简单的常识性标准(大写单词的序列;尝试并容纳频繁的小写函数单词,如'of'到这些;由“已知标题”加上大写单词组成的序列);
  • 寻找统计上你不希望偶然出现的单词序列作为实体的候选者;
  • 你可以建立动态网页查找吗? (您的系统会查看大写序列“IBM”并查看是否找到例如带有文本模式“IBM is ... [organization | company | ...]”的维基百科条目。
  • 看看这里和“信息提取”文献中是否提供了一些想法:http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html


答案 1 :(得分:0)


    String text = "Vulcans are a humanoid species in the fictional \"Star Trek\"" +
        " universe who evolved on the planet Vulcan and are noted for their " +
        "attempt to live by reason and logic with no interference from emotion" +
        " They were the first extraterrestrial species officially to make first" +
        " contact with Humans and later became one of the founding members of the" +
        " \"United Federation of Planets\"";
    String[] entities = new String[10];                 // An array to hold matched substrings
    Pattern pattern = Pattern.compile("[\"](.*?)[\"]"); // The regex pattern to use
    Matcher matcher = pattern.matcher(text);            // The matcher - our text - to run the regex on
    int startFrom   = text.indexOf('"');                // The index position of the first " character
    int endAt       = text.lastIndexOf('"');            // The index position of the last " character
    int count       = 0;                                // An index for the array of matches
    while (startFrom <= endAt) {                        // startFrom will be changed to the index position of the end of the last match
        matcher.find(startFrom);                        // Run the regex find() method, starting at the first " character
        entities[count++] = matcher.group(1);           // Add the match to the array, without its " marks
        startFrom = matcher.end();                      // Update the startFrom index position to the end of the matched region


    int startFrom = text.indexOf('"');                              // The index-position of the first " character
    int nextQuote = text.indexOf('"', startFrom+1);                 // The index-position of the next " character
    int count = 0;                                                  // An index for the array of matches
    while (startFrom > -1) {                                        // Keep looping as long as there is another " character (if there isn't, or if it's index is negative, the value of startFrom will be less-than-or-equal-to -1)
        entities[count++] = text.substring(startFrom+1, nextQuote); // Retrieve the substring and add it to the array
        startFrom = text.indexOf('"', nextQuote+1);                 // Find the next " character after nextQuote
        nextQuote = text.indexOf('"', startFrom+1);                 // Find the next " character after that



    int i = 0;
    while (i < count) {


    static int countQuoteChars(String text) {
        int nextQuote = text.indexOf('"');              // Find the first " character
        int count = 0;                                  // A counter for " characters found
        while (nextQuote != -1) {                       // While there is another " character ahead
            count++;                                    // Increase the count by 1
            nextQuote = text.indexOf('"', nextQuote+1); // Find the next " character
        return count;                                   // Return the result

    static boolean quoteCharacterParity(int numQuotes) {
        if (numQuotes % 2 == 0) { // If the number of " characters modulo 2 is 0
            return true;          // Return true for even
        return false;             // Otherwise return false

请注意,如果numQuotes恰好是0,则此方法仍会返回true(因为0模数任何数字都为0,因此(count % 2 == 0)将为true )虽然如果没有“字符,你不想继续解析,所以你想在某处检查这种情况。


答案 2 :(得分:0)

其他人问了一个关于how to find "interesting" words in a corpus of text的类似问题。你应该阅读答案。特别是,Bolo的答案指向了一篇有趣的文章,该文章使用单词的外观密度来决定它的重要性 - 使用观察结果,当文本谈论某事时,它通常会经常引用某些内容。本文很有意思,因为该技术不需要事先了解正在处理的文本(例如,您不需要针对特定​​词典的字典)。


