Question

我正在搜索使用Regex或类似的东西删除文档中的重复项;删除以下内容：

First Line

<Important text /><Important text />Other random words

我需要移除<some text/>的副本，并保持其他所有内容保持不变。文本可能在多行上，也可能不在。

它需要处理几个不同的单词，但使用＆lt; ＆GT;标签

编辑：

我不知道会说些什么。有些将嵌套在＆lt; ＆GT;标签和一些不会。我需要删除所有重复的重复项，例如：

<text/><text/><words/><words/><words/>

输出应为：

<text/><words/>

Answer 1

此正则表达式将搜索重复的标记(<.+?\/>)(?=\1)，此处为Regex 101 to prove it。

Answer 2

您可以使用：

Regex.Replace(input, "(<Important text />)+", "<Important text />");

这将使用<Important text />的单个实例替换<Important text />重复一次或多次的任何实例。

或更简单：

Regex.Replace(input, "(<Important text />)+", "$1");

例如：

var input = "<Important text /><Important text />Other random words";
var output = Regex.Replace(input, "(<Important text />)+", "$1");

Console.WriteLine(output); // <Important text />Other random words

如果您想同时处理多个此类模式，则应使用替换（|），指定您要处理的每个单词以及反向引用（{{1}找到重复：

\1

例如：

Regex.Replace(input, @"(<(?:Important text|Other text) />)\1+", "$1");

Answer 3

您应该创建一个包含所有标签的字典，即＆lt;和/＆gt;包括括号和它们的计数（这可以用正则表达式完成）。然后再次迭代，删除重复项或不将它们输出到新的字符串/数据结构。

Answer 4

就个人而言，我不喜欢Regex带标签。

拆分每个标签上的文字，删除带有Distinct的重复项，加入结果并发表声明。

string input1 = "<Important text /><Important text />Other random words";
string input2 = "<text/><text/><words/><words/><words/>";

string result1 = RemoveDuplicateTags(input1); // "<Important text />Other random words"
string result2 = RemoveDuplicateTags(input2); // "<text/><words/>"

private string RemoveDuplicateTags(string input)
{
    IEnumerable<string> tagsOrRandomWords = input.Split('>');
    tagsOrRandomWords = tagsOrRandomWords.Distinct();

    return string.Join(">", tagsOrRandomWords);
}

或者如果您喜欢不太可读的单行：

private string RemoveDuplicateTags(string input)
{
    return string.Join(">", input.Split('>').Distinct());
}

使用ASP.NET Regex删除具有条件的重复项

4 个答案: