Question

我有一个包含我想要标记的单词的文本，要标记的单词包含在List中。问题是其中一些单词是其他单词的子串，但我想从列表中标记最长的识别字符串。

例如，如果我的文字是＆＃34; foo和bar与foo bar不同。＆＃34;我的清单包含＆＃34; foo＆＃34;，＆＃34; bar＆＃34;和＆＃34; foo bar＆＃34;结果应该是＆＃34; [tag] foo [/ tag]和[tag] bar [/ tag]与[tag] foo bar [/ tag]不同。＆＃34;

$(".FormDescription").keydown(function(event) {
        $(".FormDescriptionPlaceholder").css('display', 'none');
        if($(".FormDescription").text().length<=0)
        {
            $(".FormDescriptionPlaceholder").css('display', 'block');
        }

    })

someFunction的代码应该是什么，字符串taggedText的值是String text = "foo and bar are different from foo bar."; List<String> words = new ArrayList(); words.add("foo"); words.add("bar"); words.add("foo bar"); String tagged = someFunction(text, words);？

Answer 1

使用String的split方法。并将每个单词与List进行比较。

String somefunction(String text, List<String> words){
  String res = "";
  String[] splits = text.split(" ");
  for(String st: splits){
    if(words.contains(st){
       res += "<tag>"+st+"<\tag>\n";
    }
  }
  return res;
}

Answer 2

您希望使用包含每个可能单词的正则表达式，以及一个或多个或它们的贪婪匹配。然后你可以使用正则表达式的匹配结果来获得每个匹配，并且因为它是贪婪的，所以每个匹配将是最大长度。正则表达式本身将取决于你的单词和你对空格的重要性，以及foobar是否被视为与＃34; foo＆＃34;和＆＃34; bar＆＃34;。

Answer 3

用标记替换所有匹配的单词（在我的例子中，我使用| i |作为标记，其中i对应于标记单词的索引。）尝试此方法：

private static String someFunction(String text, List<String> words) {
        //Container for the tagged strings
        List<String> tagged = new ArrayList<>();

        //Create comparator class for sorting list according to string length
        Comparator<String> x = new Comparator<String>() {
            @Override
            public int compare(String s1, String s2)
            {
                if(s1.length() > s2.length())
                    return -1;

                if(s2.length() > s1.length())
                    return 1;

                return 0;
            }
        };

        //Sort list
        Collections.sort(words, x);

        //Replace all words in the text that matches a word in the word list
        //Note that we replace the matching word with a marker |0|, |1|, etc...
        for (int i = 0; i < words.size(); i++) {
            text = text.replaceAll(words.get(i), "\\|" + i + "\\|");
            //Save the matching word and put it between tags
            tagged.add("<tag>" + words.get(i) + "</tag>");
        }

        //Replace all markers with the tagged words
        for (int i = 0; i < tagged.size(); i++) {
            text = text.replaceAll("\\|" + i + "\\|", tagged.get(i));
        }


        return text;
    }

警告：我假设我的标记'| i |'将永远不会出现在文本中。将我的标记替换为您希望不会出现在文本中的任何符号。这只是一个想法，而不是完美的答案。

Answer 4

这闻起来像家庭作业，但我会给你一些指示。

如果B是A的子串，如果B不等于A，则B的长度必须小于A的长度。你也自己说过：

[...]但我想从列表中标记最长识别的字符串。

所以我们必须按长度对单词列表进行排序，最长。我会留给你弄清楚如何做到这一点。 Collections.sort(List<T>, Comparator<? super T>)就是你要用的。

下一个问题是实际更换。如果您只是简单地循环所有单词并使用String.replaceAll(String, String)，那么您的示例最终将如下所示：

<tag>foo</tag> and <tag>bar</tag> are different from <tag><tag>foo</tag> <tag>bar</tag></tag>.

那是因为我们将首先围绕'foo bar'，然后我们将再次围绕foo和bar。值得庆幸的是，String.replaceAll(String, String)的第一个参数是正则表达式。

诀窍是匹配这个词，但只有当它还没有被包围时。但不仅仅是包围，领导或落后，因为它可能是已标记foo中的<tag>foo bar</tag>。只有当"(?<!(\\w|>))+" + word + "(?!(\\w|<))+"没有前导word，尾随>并且不在另一个单词的中间时，<之类的内容才会匹配。（我承认，我对正则表达式并不擅长，所以我相信这可能会更好）

如何替换文本中的字符串列表，其中一些字符串是其他字符串的子串？

4 个答案: