例如,如果我有像
这样的文字first line of text
badger Royal lemon, night trail
light of. Random string of words
that don't match anymore.
我的结果必须是单词行,其中每个单词的最后一个字符与下一个单词的第一个字符匹配,即使其间有分隔符。在这种情况下:
badger Royal lemon, night trail
light
如果我想使用正则表达式,最简单的方法是什么?
答案 0 :(得分:2)
匹配每个单词序列的正则表达式为:
(?:\b\w+(\w)\b[\W]*(?=\1))*\1\w+
根据您关于允许使用句号,分号,逗号等的规则,您需要调整\W
部分。
请注意,这也假设单个字母单词会破坏序列。
然后你可以遍历每个事件并找到最长的:
try {
Regex regexObj = new Regex(@"(?:\b\w+(\w)\b[\W+]*(?=\1))*\1\w+", RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
// @todo here test and keep the longest match.
matchResults = matchResults.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
// (?:\b\w+(\w)\b[\W]*(?=\1))*\1\w+
//
// Options: Case insensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Numbered capture
//
// Match the regular expression below «(?:\b\w+(\w)\b[\W]*(?=\1))*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore) «\b»
// Match a single character that is a “word character” (Unicode; any letter or ideograph, digit, connector punctuation) «\w+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regex below and capture its match into backreference number 1 «(\w)»
// Match a single character that is a “word character” (Unicode; any letter or ideograph, digit, connector punctuation) «\w»
// Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore) «\b»
// Match a single character that is NOT a “word character” (Unicode; any letter or ideograph, digit, connector punctuation) «[\W]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=\1)»
// Match the same text that was most recently matched by capturing group number 1 (case insensitive; fail if the group did not participate in the match so far) «\1»
// Match the same text that was most recently matched by capturing group number 1 (case insensitive; fail if the group did not participate in the match so far) «\1»
// Match a single character that is a “word character” (Unicode; any letter or ideograph, digit, connector punctuation) «\w+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
答案 1 :(得分:0)
我知道这不是正则表达式实现,但......也许有帮助。这是C#中的一个简单实现:
public static string Process (string s)
{
var split = s.Split(new[] { '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
if (split.Length < 2)
return null; // impossible to find something if the length is not at least two
string currentString = null;
string nextString = null;
for (var i = 0; i < split.Length - 1; i++)
{
var str = split[i];
if (str.Length == 0) continue;
var lastChar = str[str.Length - 1];
var nextStr = split[i + 1];
if (nextStr.Length == 0) continue;
var nextChar = nextStr[0];
if (lastChar == nextChar)
{
if (currentString == null)
{
currentString = str;
nextString = nextStr.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)[0];
}
else
{
if (str.Length > currentString.Length)
{
currentString = str;
nextString = nextStr.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)[0];
}
}
}
}
return currentString == null ? null : currentString + "\n" + nextString;
}
答案 2 :(得分:0)
正则表达式无法真正告诉字符串中最长的内容。
但是,使用@DeanTaylor方法,如果全局匹配,您可以存储最长的 基于匹配的字符串长度。
这是他的正则表达式的轻微变化,但它的工作原理相同。
(?:\w*(\w)\W+(?=\1))+\w+
格式化:
(?:
\w*
( \w ) # (1)
\W+
(?= \1 )
)+
\w+