Question

我在文件中有一个单词列表。它们可能包含诸如谁，没有等等词。因此，当从中读取时，我需要使它们正确，如“谁是谁”和“没有”。这必须用Java完成。我需要这样做而不会浪费太多时间。

这实际上是用于在使用solr的搜索期间处理此类查询。

下面是我尝试使用哈希映射的示例代码

Map<String, String> con = new HashMap<String, String>();
        con.put("'s", " is");
        con.put("'d", " would");
        con.put("'re", " are");
        con.put("'ll", " will");
        con.put("n't", " not");
        con.put("'nt", " not");

        String temp = null;
        String str = "where'd you're you'll would'nt hello";

        String[] words = str.split(" ");
        int index = -1 ;
        for(int i = 0;i<words.length && (index =words[i].lastIndexOf('\''))>-1;i++){
            temp = words[i].substring(index);
            if(con.containsKey(temp)){
                 temp = con.get(temp);
            }
            words[i] = words[i].substring(0, index)+temp;
            System.out.println(words[i]);           
        }

Answer 1

如果您担心查询包含例如“谁”正在查找包含例如“谁是”的文档，那么您应该考虑使用专为此目的而设计的Stemmer。

您可以轻松添加一个词干分析器，将其配置为solr配置中的过滤器。见http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

编辑：
SnowballPorterFilterFactory可能会为你完成这项工作。

Answer 2

来自@James Jithin的最后一句话：

“s” - ＆gt;如果单词是占有形式，则“is”变换是不正确的。
“d” - ＆gt; “将”变换在古老的形式中是不正确的，其中“'d”可以是“ed”的缩写。
“'nt” - ＆gt; “不”变换是不正确的，因为这实际上只是对“非”收缩的错误拼写。（我的意思是“wo'nt”是完全错误的......不是吗。）

因此，在我看来，实现这一目标的最佳方法是列举少数常见且有效的收缩，剩下的就是其余部分。这样做的另一个好处是，您可以使用简单的字符串匹配而不是后缀匹配来实现它。

Answer 3

代码可以写成

Map<String, String> con = new HashMap<String, String>();
    con.put("'s", " is");
    con.put("'d", " would");
    con.put("'re", " are");
    con.put("'ll", " will");
    con.put("n't", " not");
    con.put("'nt", " not");

    String str = "where'd you're you'll would'nt hello";

    for(String key : con.keySet()) {
        str = str.replaceAll(key + "\\b" , con.get(key));
    }

拥有你的逻辑。但是假设它的script's是一个显示拥有的词，将其改为script is改变了意义。

使用Java处理单词收缩的有效方法是什么？

3 个答案: