如何使用OpenNLP从给定文本中提取关键短语?

时间:2015-09-09 17:33:50

标签: java lucene nlp keyword opennlp

我正在使用Apache OpenNLP,我想提取给定文本的关键短语。我已经收集了实体 - 但我希望有Keyphrases。

我遇到的问题是我不能使用TF-IDF,因为我没有模型,我只有一个文本(不是多个文件)

这是一些代码(原型 - 不那么干净)

 public List<KeywordsModel> extractKeywords(String text, NLPProvider pipeline) {

        SentenceDetectorME sentenceDetector = new SentenceDetectorME(pipeline.getSentencedetecto("en"));
        TokenizerME tokenizer = new TokenizerME(pipeline.getTokenizer("en"));
        POSTaggerME posTagger = new POSTaggerME(pipeline.getPosmodel("en"));
        ChunkerME chunker = new ChunkerME(pipeline.getChunker("en"));

        ArrayList<String> stopwords = pipeline.getStopwords("en");

        Span[] sentSpans = sentenceDetector.sentPosDetect(text);
        Map<String, Float> results = new LinkedHashMap<>();
        SortedMap<String, Float> sortedData = new TreeMap(new MapSort.FloatValueComparer(results));

        float sentenceCounter = sentSpans.length;
        float prominenceVal = 0;
        int sentences = sentSpans.length;
        for (Span sentSpan : sentSpans) {
            prominenceVal = sentenceCounter / sentences;
            sentenceCounter--;
            String sentence = sentSpan.getCoveredText(text).toString();
            int start = sentSpan.getStart();
            Span[] tokSpans = tokenizer.tokenizePos(sentence);
            String[] tokens = new String[tokSpans.length];
            for (int i = 0; i < tokens.length; i++) {
                tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
            }
            String[] tags = posTagger.tag(tokens);
            Span[] chunks = chunker.chunkAsSpans(tokens, tags);
            for (Span chunk : chunks) {
                if ("NP".equals(chunk.getType())) {
                    int npstart = start + tokSpans[chunk.getStart()].getStart();
                    int npend = start + tokSpans[chunk.getEnd() - 1].getEnd();
                    String potentialKey = text.substring(npstart, npend);
                    if (!results.containsKey(potentialKey)) {
                        boolean hasStopWord = false;
                        String[] pKeys = potentialKey.split("\\s+");
                        if (pKeys.length < 3) {
                            for (String pKey : pKeys) {
                                for (String stopword : stopwords) {
                                    if (pKey.toLowerCase().matches(stopword)) {
                                        hasStopWord = true;
                                        break;
                                    }
                                }
                                if (hasStopWord == true) {
                                    break;
                                }
                            }
                        }else{
                            hasStopWord=true;
                        }
                        if (hasStopWord == false) {
                            int count = StringUtils.countMatches(text, potentialKey);
                            results.put(potentialKey, (float) (Math.log(count) / 100) + (float)(prominenceVal/5));
                        }
                    }
                }
            }
        }
        sortedData.putAll(results);
        System.out.println(sortedData);
        return null;
    }

它基本上做的是给我回名词并按突出值排序(文本中的哪个位置?)并计算。

但老实说 - 这不太好用。

我也尝试使用lucene分析仪,但结果也不太好。

那么 - 我怎样才能实现我想做的事情?我已经知道KEA / Maui-indexer等(但我担心因为GPL而无法使用它们:()

还有意思吗?我可以使用哪种其他算法代替TF-IDF?

示例:

此文字:http://techcrunch.com/2015/09/04/etsys-pulling-the-plug-on-grand-st-at-the-end-of-this-month/

我认为产量很高:Etsy,Grand St.,太阳能充电器,制造商市场,科技硬件

0 个答案:

没有答案