Question

我正在使用Solr中的WordDelimiterFilter进行一些测试，但它不会保留我传递给它的受保护的单词列表。请您检查代码和输出示例，并建议哪个部分丢失或使用不当？

运行此代码：

private static Analyzer getWordDelimiterAnalyzer() {
    return new Analyzer() {
        @Override
        public TokenStream tokenStream(String fieldName, Reader reader) {
            TokenStream stream = new StandardTokenizer(Version.LUCENE_32, reader);
            WordDelimiterFilterFactory wordDelimiterFilterFactory = new WordDelimiterFilterFactory();
            HashMap<String, String> args = new HashMap<String, String>();
            args.put("generateWordParts", "1");
            args.put("generateNumberParts", "1");
            args.put("catenateWords", "1");
            args.put("catenateNumbers", "1");
            args.put("catenateAll", "0");
            args.put("luceneMatchVersion", Version.LUCENE_32.name());
            args.put("language", "English");
            args.put("protected", "protected.txt");
            wordDelimiterFilterFactory.init(args);
            ResourceLoader loader = new SolrResourceLoader(null, null);
            wordDelimiterFilterFactory.inform(loader);
            /*List<String> protectedWords = new ArrayList<String>();
            protectedWords.add("good bye");
            protectedWords.add("hello world");
            wordDelimiterFilterFactory.inform(new LinesMockSolrResourceLoader(protectedWords));
            */
            return wordDelimiterFilterFactory.create(stream);
        }
    };
}

输入文字：

你好世界

再见

你对未来的计划是什么？

受保护的字符串：

再见

你好世界

输出：

（你好，开始偏移= 0，endOffset = 5，positionIncrement = 1，类型=）

（世界，开始偏移= 6，endOffset = 11，positionIncrement = 1，类型=）

（好，开始偏移= 12，endOffset = 16，positionIncrement = 1，类型=）

（再见，开始偏移= 17，endOffset = 20，positionIncrement = 1，类型=）

（什么，开始偏移= 21，endOffset = 25，positionIncrement = 1，类型=）

（是，开始偏移= 26，endOffset = 28，positionIncrement = 1，类型=）

（您的，开始偏移= 29，endOffset = 33，positionIncrement = 1，类型=）

（方案，开始偏移= 34，endOffset = 38，positionIncrement = 1，类型=）

（对于，开始偏移= 39，endOffset = 42，positionIncrement = 1，类型=）

（未来，开始偏移= 43，endOffset = 49，positionIncrement = 1，类型=）

Answer 1

你正在使用一个标准的标记器，至少在空白级别上进行标记，这样你就可以将“hello world”分成“hello”和“world”。

TokenStream stream = new StandardTokenizer(Version.LUCENE_32, reader);

请参阅Lucene文档：

public final class StandardTokenizer扩展Tokenizer

使用JFlex构建的基于语法的标记生成器

这应该是大多数欧洲语言文档的良好标记器：

在标点字符处拆分单词，删除标点符号。   但是，没有空格的点被认为是其中的一部分   一个令牌。

在连字符上拆分单词，除非令牌中有数字   在哪种情况下，整个令牌被解释为产品编号而且是   不分裂。

将电子邮件地址和互联网主机名识别为一个令牌。

单词分隔符保护单词列表适用于以下内容：

ISBN2345677将拆分为ISBN 2345677
text2html不能在文本中拆分2 html（因为text2html被添加到受保护的单词中）

如果您真的想要像您提到的那样做，可以使用 KeywordTokenizer 。但是你必须自己完成分裂。

使用Solr WordDelimiterFilter时出现问题

输入文字：

你好世界

再见

你对未来的计划是什么？

受保护的字符串：

再见

你好世界

输出：

（你好，开始偏移= 0，endOffset = 5，positionIncrement = 1，类型=）

（世界，开始偏移= 6，endOffset = 11，positionIncrement = 1，类型=）

（好，开始偏移= 12，endOffset = 16，positionIncrement = 1，类型=）

（再见，开始偏移= 17，endOffset = 20，positionIncrement = 1，类型=）

（什么，开始偏移= 21，endOffset = 25，positionIncrement = 1，类型=）

（是，开始偏移= 26，endOffset = 28，positionIncrement = 1，类型=）

（您的，开始偏移= 29，endOffset = 33，positionIncrement = 1，类型=）

（方案，开始偏移= 34，endOffset = 38，positionIncrement = 1，类型=）

（对于，开始偏移= 39，endOffset = 42，positionIncrement = 1，类型=）

（未来，开始偏移= 43，endOffset = 49，positionIncrement = 1，类型=）

1 个答案: