Question

我使用斯坦福大学的NLP postagger来标记我的课程中的名词形容词。

    interest_NN 
    bui_NNS 
    ground_VBP
     avail_NN 
    respond_NN
     detail_NN 
    like_IN 
    quickli_NNS
    current_JJ

现在我必须只选择那些带有_NN，_NNS，_JJ标签的单词，并从单词中删除这些标签。

    quickli
    current
    avail

我试过这样就从单词中删除了-NN标签。但它删除了前2个标签并从中获得了例外

           while(tagread.hasNext())
           {
        String s=tagread.next();

        int flag=1;
        jTextArea2.append("\n" +s.toLowerCase());


        String ofInterest2 = s.substring(0, s.indexOf("_NN"));


         for(int i=0;i<s.length();i++){
             if(s.equals(ofInterest2))
                 {
                 flag=0;
                 }
         }
         if(flag!=0)
         {
             System.out.println(ofInterest2);

         }
    }

例外：

 java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(Unknown Source)

我的方法有什么问题？或者如何进一步继续？

Answer 1

不要使用字符串方法删除标记文本;使用NLP的API提取词性以进行比较。

生成List个TaggedWord个对象，然后使用TaggedWord API直接提取词性：

// Call the API to parse your sentence.
List<TaggedWord> words = tagger.tagSentence( ... );

// For each word tagged in the sentence...
for( TaggedWord word : words ) {
  String tag = word.tag();

  // Check the part-of-speech directly, without having to parse the string.
  if( "NN".equalsIgnoreCase( tag ) ) {
    System.out.printf( "%s is a noun\n", word.word() );
  }
}

另见斯坦福大学的NLP API：

要检查名词，您应该避免以下情况：

if( "NN".equalsIgnoreCase( tag ) ) {
  System.out.printf( "%s is a noun\n", word.word() );
}

这是因为可以通过多种方式标记词性（例如，NN，NNS）。您可以使用正则表达式或startsWith。

您应该要求TaggedWord的作者提供isNoun。 isVerb，isNounPlural和其他此类方法。也就是说，是的，您可以使用正则表达式来匹配字符串。我还在我的代码中使用startsWith来检查名词，因为它比正则表达式更快。例如：

if( tag != null && tag.toUpperCase().startsWith( "NN" ) ) {
  System.out.printf( "%s is a noun\n", word.word() );
}

要成为真正的OO，请为标记器注入TaggedWord的子类以供使用。然后子类将公开isNoun方法。

Answer 2

当您在String中找不到您提供的参数时，

indexOf返回-1。在这一行：

String ofInterest2 = s.substring(0, s.indexOf("_NN"));

s.indexOf可能在字符串s中找不到“_NN”。然后，当你要求从0到-1的子字符串时，这没有意义，所以你会得到一个例外。

Answer 3

您正在尝试获取整个文本“ground_VBP”的子字符串，但是您传递了s.indexOf("_NN")的结果。找不到子字符串，因此返回-1。但-1不是substring函数的有效索引，因此substring会抛出您报告的StringIndexOutOfBoundsException。

如果indexOf方法返回0或更大的值（即找到它），则只应该使用子字符串。

在java中使用postagger后从单词中删除标签

3 个答案: