我有一种使用Regex
在文本string
中查找模式的方法。它可以工作,但并不合适,因为它要求文本以确切的顺序出现,而不是将该词组视为一组单词。
public static string HighlightExceptV1(this string text, string wordsToExclude)
{
// Original version
// wordsToExclude usually consists of a 1, 2 or 3 word term.
// The text must be in a specific order to work.
var pattern = $@"(\s*\b{wordsToExclude}\b\s*)";
// Do something to string...
}
此版本对以前的版本进行了改进,因为它确实允许单词以任何顺序进行匹配,但是在最终输出中会引起一些间距问题,因为间距已被删除并替换为管道。
public static string HighlightExceptV2(this string text, string wordsToExclude)
{
// This version allows the words to be matched in any order, but it has
// flaws, in that the natural spacing is removed in some cases.
var words = wordsToExclude.Replace(' ', '|');
var pattern = $@"(\s*\b{words}\b\s*)";
// Example phase: big blue widget
// Example output: $@"(\s*\bbig|blue|widget\b\s*)"
// Do something to string...
}
理想地,需要在每个单词周围保留间距。下面的伪示例显示了我正在尝试做的事情。
重新加入单词模式以产生将用于 匹配
public static string HighlightExceptV3(this string text, string wordsToExclude)
{
// The outputted pattern must be dynamic due to unknown
// words in phrase.
// Example phrase: big blue widgets
var words = wordsToExclude.Replace(' ', '|');
// Example: big|blue|widget
// The code below isn't complete - merely an example
// of the required output.
var wordPattern = $@"\s*\b{word}\b\s*";
// Example: $@"\s*\bwidget\b\s*"
var phrasePattern = "$({rejoinedArray})";
// @"(\s*\bbig\b\s*|\s*\bblue\b\s*|\s*\bwidget\b\s*)";
// Do something to string...
}
注意:可能有更好的方法来处理单词边界间距,但我不是正则表达式专家。
我正在寻找一些帮助/建议来分割阵列,将其包装,然后以最简洁的方式重新加入。
答案 0 :(得分:2)
您需要将所有替代项包含在一个非捕获组finally
中。此外,为了进一步解决最终问题,我建议将单词边界替换为环顾四周的明确等同词(?:...|...)
。
(?<!\w)...(?!\w)
注释
var text = "there are big widgets in this phrase blue widgets too";
var words = "big blue widgets";
var pattern = $@"(\s*(?<!\w)(?:{string.Join("|", words.Split(' ').Select(Regex.Escape))})(?!\w)\s*)";
var result = string.Concat(Regex.Split(text, pattern, RegexOptions.IgnoreCase).Select((str, index) =>
index % 2 == 0 && !string.IsNullOrWhiteSpace(str) ? $"<b>{str}</b>" : str));
Console.WriteLine(result);
-用空格分隔words.Split(' ').Select(Regex.Escape)
文本并用正则表达式转义每个项目words
重建在项目之间插入string.Join("|",...)
的字符串|
负向后匹配与不立即以字符char开头的位置,而(?<!\w)
负向后匹配与未立即以字符char开头的位置匹配。答案 1 :(得分:2)
我建议使用2
状态(输入和输出选择)和Regex.Replace
(我们可以将单词原样保留-{{ 1}}或将其替换为word
,<b>word
或word<\b>
)
<b>word<\b>
演示:
private static string MyModify(string text, string wordsToExclude) {
HashSet<string> exclude = new HashSet<string>(
wordsToExclude.Split(' '), StringComparer.OrdinalIgnoreCase);
bool inSelection = false;
string result = Regex.Replace(text, @"[\w']+", match => {
var next = match.NextMatch();
if (inSelection) {
if (next.Success && exclude.Contains(next.Value)) {
inSelection = false;
return match.Value + "</b>";
}
else
return match.Value;
}
else {
if (exclude.Contains(match.Value))
return match.Value;
else if (next.Success && exclude.Contains(next.Value))
return "<b>" + match.Value + "</b>";
else {
inSelection = true;
return "<b>" + match.Value;
}
}
});
if (inSelection)
result += "</b>";
return result;
}
结果:
string wordsToExclude = "big widgets blue if";
string[] tests = new string[] {
"widgets for big blue",
"big widgets are great but better if blue",
"blue",
"great but expensive",
"big and small, blue and green",
};
string report = string.Join(Environment.NewLine, tests
.Select(test => $"{test,-40} -> {MyModify(test, wordsToExclude)}"));
Console.Write(report);