使用lucene进行关键字提取时出错

时间:2014-06-25 08:42:51

标签: java lucene tokenize feature-extraction

我对文本提取概念完全陌生。当我在寻找一个例子时,我找到了一个使用Lucene实现的例子。我只是试图在eclipse中运行它,但它给出了一个错误。这是我得到的错误:( TokenStream合同违规:reset()/ close()调用丢失,reset()多次调用,或者子类不调用super.reset()。请参阅TokenStream类的Javadocs以获取更多信息有关正确消费工作流程的信息)。我直接从网上获得了一篇已发表文章的代码并进行了一些修改,因为首先我想确保代码在没有错误的情况下正常运行,然后逐一理解它的部分。原始代码是从URL获取文本但我更改它以从定义的字符串中获取文本(它在main方法中)。我也改变了版本,因为我使用的是lucene 4.8版本。

我还搜索了错误并做了一些修改,但我仍然收到错误。我这里的代码。你能不能帮我摆脱这个错误。我应该在哪里修改以避免错误。这是我获取代码的链接http://pastebin.com/jNALz7DZ这是我修改的代码。

public class KeywordsGuesser {

     /** Lucene version. */
     private static Version LUCENE_VERSION = Version.LUCENE_48;

     /**
      * Keyword holder, composed by a unique stem, its frequency, and a set of found corresponding
      * terms for this stem.
      */
    public static class Keyword implements Comparable<Keyword> {

         /** The unique stem. */
         private String stem;

         /** The frequency of the stem. */
         private Integer frequency;

         /** The found corresponding terms for this stem. */
        private Set<String> terms;

         /**
          * Unique constructor.
          * 
          * @param stem The unique stem this instance must hold.
          */
         public Keyword(String stem) {
             this.stem = stem;
            terms = new HashSet<String>();
             frequency = 0;
         }

         /**
          * Add a found corresponding term for this stem. If this term has been already found, it
          * won't be duplicated but the stem frequency will still be incremented.
          * 
          * @param term The term to add.
          */
         private void add(String term) {
             terms.add(term);
             frequency++;
         }

         /**
          * Gets the unique stem of this instance.
          * 
          * @return The unique stem.
          */
         public String getStem() {
             return stem;
         }

         /**
          * Gets the frequency of this stem.
          * 
          * @return The frequency.
          */
         public Integer getFrequency() {
             return frequency;
         }

         /**
          * Gets the list of found corresponding terms for this stem.
          * 
          * @return The list of found corresponding terms.
          */
        public Set<String> getTerms() {
             return terms;
         }

         /**
          * Used to reverse sort a list of keywords based on their frequency (from the most frequent
          * keyword to the least frequent one).
          */
         @Override
         public int compareTo(Keyword o) {
             return o.frequency.compareTo(frequency);
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public boolean equals(Object obj) {
             return obj instanceof Keyword && obj.hashCode() == hashCode();
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public int hashCode() {
             return Arrays.hashCode(new Object[] { stem });
         }

         /**
          * User-readable representation of a keyword: "[stem] x[frequency]".
          */
         @Override
         public String toString() {
             return stem + " x" + frequency;
         }

     }

     /**
      * Stemmize the given term.
      * 
      * @param term The term to stem.
      * @return The stem of the given term.
      * @throws IOException If an I/O error occured.
      */
     private static String stemmize(String term) throws IOException {

         // tokenize term
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(term));
         // stemmize
         tokenStream = new PorterStemFilter(tokenStream);

         Set<String> stems = new HashSet<String>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
         // for each token
         while (tokenStream.incrementToken()) {
             // add it in the dedicated set (to keep unicity)
             stems.add(token.toString());
         }

         // if no stem or 2+ stems have been found, return null
         if (stems.size() != 1) {
             return null;
         }

         String stem = stems.iterator().next();

         // if the stem has non-alphanumerical chars, return null
         if (!stem.matches("[\\w-]+")) {
             return null;
         }

         return stem;
     }

     /**
      * Tries to find the given example within the given collection. If it hasn't been found, the
      * example is automatically added in the collection and is then returned.
      * 
      * @param collection The collection to search into.
      * @param example The example to search.
      * @return The existing element if it has been found, the given example otherwise.
      */
     private static <T> T find(Collection<T> collection, T example) {
         for (T element : collection) {
             if (element.equals(example)) {
                 return element;
             }
         }
         collection.add(example);
         return example;
     }

     /**
      * Extracts text content from the given URL and guesses keywords within it (needs jsoup parser).
      * 
      * @param The URL to read.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occurred.
      * @see <a href="http://jsoup.org/">http://jsoup.org/</a>
      */
     public static List<Keyword> guessFromUrl(String url) throws IOException {
         // get textual content from url
         //Document doc = Jsoup.connect(url).get();
         //String content = doc.body().text();

       String content = url;
         // guess keywords from this content
         return guessFromString(content);
     }

     /**
      * Guesses keywords from given input string.
      * 
      * @param input The input string.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occured.
      */
     public static List<Keyword> guessFromString(String input) throws IOException {

         // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
         input = input.replaceAll("-+", "-0");
         // replace any punctuation char but dashes and apostrophes and by a space
         input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
         // replace most common English contractions
         input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

         // tokenize input
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(input));
         // to lower case
         tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
         // remove dots from acronyms (and "'s" but already done manually above)
         tokenStream = new ClassicFilter(tokenStream);
         // convert any char to ASCII
         tokenStream = new ASCIIFoldingFilter(tokenStream);
         // remove english stop words
         tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());

         List<Keyword> keywords = new LinkedList<Keyword>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

         // for each token
         while (tokenStream.incrementToken()) {
             String term = token.toString();
             // stemmize
             String stem = stemmize(term);
             if (stem != null) {
                 // create the keyword or get the existing one if any
                 Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
                 // add its corresponding initial token
                 keyword.add(term.replaceAll("-0", "-"));
             }
         }



         tokenStream.end();
         tokenStream.close();


         // reverse sort by frequency
         Collections.sort(keywords);

         return keywords;
     }



     public static void main(String args[]) throws IOException{

       String input = "Java is a computer programming language that is concurrent, "
               + "class-based, object-oriented, and specifically designed to have as few "
               + "implementation dependencies as possible. It is intended to let application developers "
               + "write once, run anywhere (WORA), "
               + "meaning that code that runs on one platform does not need to be recompiled "
               + "to run on another. Java applications are typically compiled to byte code (class file) "
               + "that can run on any Java virtual machine (JVM) regardless of computer architecture. "
               + "Java is, as of 2014, one of the most popular programming languages in use, particularly "
               + "for client-server web applications, with a reported 9 million developers."
               + "[10][11] Java was originally developed by James Gosling at Sun Microsystems "
               + "(which has since merged into Oracle Corporation) and released in 1995 as a core "
               + "component of Sun Microsystems' Java platform. The language derives much of its syntax "
               + "from C and C++, but it has fewer low-level facilities than either of them."
               + "The original and reference implementation Java compilers, virtual machines, and "
               + "class libraries were developed by Sun from 1991 and first released in 1995. As of "
               + "May 2007, in compliance with the specifications of the Java Community Process, "
               + "Sun relicensed most of its Java technologies under the GNU General Public License. "
               + "Others have also developed alternative implementations of these Sun technologies, "
               + "such as the GNU Compiler for Java (byte code compiler), GNU Classpath "
               + "(standard libraries), and IcedTea-Web (browser plugin for applets).";

       System.out.println(KeywordsGuesser.guessFromString(input));
     }



 }

这是eclipse输出的错误

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
    at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.zzRefill(ClassicTokenizerImpl.java:431)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.getNextToken(ClassicTokenizerImpl.java:638)
    at org.apache.lucene.analysis.standard.ClassicTokenizer.incrementToken(ClassicTokenizer.java:140)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at org.apache.lucene.analysis.standard.ClassicFilter.incrementToken(ClassicFilter.java:47)
    at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:104)
    at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
    at beehex.lucene.KeywordsGuesser.guessFromString(KeywordsGuesser.java:239)
    at beehex.lucene.KeywordsGuesser.main(KeywordsGuesser.java:288)

摆脱错误后,我的输出是:

  

[java x10,开发x5,sun x5,运行x4,compil x4,languag x3,   实现x3,应用x3,代码x3,gnu x3,计算机x2,程序x2,   指定x2,x2,x2,平台x2,字节x2,类x2,虚x2,   machin x2,大多数x2,原点x2,microsystem x2,ha x2,releas x2,1995   x2,它x2,来自x2,c x2,librari x2,technolog x2,concurr x1,   class-bas x1,object-ori x1,设计x1,少数x1,依赖x1,possibl x1,   打算x1,让x1,写x1,onc x1,任何x1,wora x1,意思是x1,doe   x1,需要x1,重新编译x1,anoth x1,典型x1,文件x1,可以x1,ani x1,   jvm x1,无论x1,架构x1,2014 x1,流行x1,us x1,   specialli x1,client-serv x1,web x1,报告x1,9 x1,百万x1,   10 x1,11 x1,jame x1,gosl x1,x1,sinc x1,merg x1,oracl x1,   corpor x1,core x1,compon x1,deriv x1,much x1,syntax x1,less x1,   低级x1,facil x1,而不是x1,x1,x1,x1,是x1   x1,1991 x1,第一个x1,mai x1,2007 x1,complianc x1,commun x1,   进程x1,relicens x1,x1下,gener x1,public x1,licens x1,   其他x1,x1,交替x1,类路径x1,标准x1,icedtea-web   x1,浏览器x1,插件x1,小程序x1]

1 个答案:

答案 0 :(得分:2)

在您调用TokenStream方法之前,您需要重置incrementToken对象,错误指出:

// add this line
tokenStream.reset();
while (tokenStream.incrementToken()) {
....