Question

我有一种使用Regex在文本string中查找模式的方法。它可以工作，但并不合适，因为它要求文本以确切的顺序出现，而不是将该词组视为一组单词。

    public static string HighlightExceptV1(this string text, string wordsToExclude)
    {
        // Original version
        // wordsToExclude usually consists of a 1, 2 or 3 word term.
        // The text must be in a specific order to work.

        var pattern = $@"(\s*\b{wordsToExclude}\b\s*)";

        // Do something to string...
    }

此版本对以前的版本进行了改进，因为它确实允许单词以任何顺序进行匹配，但是在最终输出中会引起一些间距问题，因为间距已被删除并替换为管道。

    public static string HighlightExceptV2(this string text, string wordsToExclude)
    {
        // This version allows the words to be matched in any order, but it has
        // flaws, in that the natural spacing is removed in some cases.
        var words = wordsToExclude.Replace(' ', '|');

        var pattern = $@"(\s*\b{words}\b\s*)";

        // Example phase: big blue widget
        // Example output: $@"(\s*\bbig|blue|widget\b\s*)"

        // Do something to string...
    }

理想地，需要在每个单词周围保留间距。下面的伪示例显示了我正在尝试做的事情。

将原始短语拆分成单词
在正则表达式模式中包装每个单词，以保留空格匹配时

重新加入单词模式以产生将用于匹配

public static string HighlightExceptV3(this string text, string wordsToExclude)
{
    // The outputted pattern must be dynamic due to unknown
    // words in phrase.

    // Example phrase: big blue widgets

    var words = wordsToExclude.Replace(' ', '|');
    // Example: big|blue|widget

    // The code below isn't complete - merely an example
    // of the required output.

    var wordPattern = $@"\s*\b{word}\b\s*";
    // Example: $@"\s*\bwidget\b\s*"

    var phrasePattern = "$({rejoinedArray})";
    // @"(\s*\bbig\b\s*|\s*\bblue\b\s*|\s*\bwidget\b\s*)";

    // Do something to string...
}

注意：可能有更好的方法来处理单词边界间距，但我不是正则表达式专家。

我正在寻找一些帮助/建议来分割阵列，将其包装，然后以最简洁的方式重新加入。

Answer 1

您需要将所有替代项包含在一个非捕获组finally中。此外，为了进一步解决最终问题，我建议将单词边界替换为环顾四周的明确等同词(?:...|...)。

这里是working C# snippet：

(?<!\w)...(?!\w)

注释

var text = "there are big widgets in this phrase blue widgets too"; var words = "big blue widgets"; var pattern = $@"(\s*(?<!\w)(?:{string.Join("|", words.Split(' ').Select(Regex.Escape))})(?!\w)\s*)"; var result = string.Concat(Regex.Split(text, pattern, RegexOptions.IgnoreCase).Select((str, index) => index % 2 == 0 && !string.IsNullOrWhiteSpace(str) ? $"<b>{str}</b>" : str)); Console.WriteLine(result);-用空格分隔words.Split(' ').Select(Regex.Escape)文本并用正则表达式转义每个项目
words重建在项目之间插入string.Join("|",...)的字符串
|负向后匹配与不立即以字符char开头的位置，而(?<!\w)负向后匹配与未立即以字符char开头的位置匹配。

Answer 2

我建议使用2状态（输入和输出选择）和Regex.Replace（我们可以将单词原样保留-{{ 1}}或将其替换为word，<b>word或word<\b>）

<b>word<\b>

演示：

private static string MyModify(string text, string wordsToExclude) {
  HashSet<string> exclude = new HashSet<string>(
    wordsToExclude.Split(' '), StringComparer.OrdinalIgnoreCase);

  bool inSelection = false;

  string result = Regex.Replace(text, @"[\w']+", match => {
      var next = match.NextMatch();

      if (inSelection) {
        if (next.Success && exclude.Contains(next.Value)) {
          inSelection = false;

          return match.Value + "</b>";
        }
        else
          return match.Value;
      }
      else {
        if (exclude.Contains(match.Value))
          return match.Value;
        else if (next.Success && exclude.Contains(next.Value))
          return "<b>" + match.Value + "</b>";
        else {
          inSelection = true;
          return "<b>" + match.Value;
        }
      }
    });

  if (inSelection)
    result += "</b>";

  return result;
}

结果：

string wordsToExclude = "big widgets blue if";

string[] tests = new string[] {
  "widgets for big blue",
  "big widgets are great but better if blue",
  "blue",
  "great but expensive",
  "big and small, blue and green",
};

string report = string.Join(Environment.NewLine, tests
  .Select(test => $"{test,-40} -> {MyModify(test, wordsToExclude)}"));

Console.Write(report);

将字符串拆分为单词，然后重新加入其他数据

2 个答案: