我试图写一个突出显示功能。突出显示有两种类型:正面和负面。积极首先完成。突出显示本身非常简单 - 只需在span
中包含特定类的关键字/短语,这取决于突出显示的类型。
问题:
有时,负面突出显示可能包含正面突出显示。
实施例
原文:
来自blahblah测试的一些数据在统计上无效
文字经过正面突出显示"过滤"后,它最终会像这样结束:
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically valid</span>
或
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>
然后在否定列表中,我们有一个短语not statistically valid
。
在这两种情况下,经过两个&#34;过滤器后生成的文本&#34;应该是这样的:
some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>
条件:
- span
代码的数量或其在关键字/词组中的位置,来自否定&#34;过滤器&#34;列表未知
- 即使关键字/词组包含span
标记(包括关键字/词组之前和之后),也必须匹配关键字/词组。必须删除这些span
标记
- 如果检测到任何span
标记,则删除的开放和关闭span
标记的数量必须相等。
问题:
- 如果有的话,如何检测这些span
标签?
- 仅使用RegEx即可实现这一点吗?
答案 0 :(得分:3)
我不认为是否可以使用单个正则表达式来完成,如果可能的话,那么说实话,我是如此懒散,无法想到它。
我找到了一个解决方案,需要4个步骤来实现您的目标:
<span class="negative">...</span>
)替换最近找到的值(以下是我们的内容:
$HTML = <<< HTML
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>
HTML;
$listOfNegatives = ['not statistically valid'];
为了提取单词(真实单词)我使用了一个RegEx来满足我们在这一步的需求:
~\b(?<![</])\w+\b(?![^<>]+>)~
要获得每个单词的位置,应在preg_match_all()
使用标记:PREG_OFFSET_CAPTURE
/**
* Extract all words and their corresponsing positions
* @param [string] $HTML
* @return [array] $HTMLWords
*/
function extractWords($HTML) {
$HTMLWords = [];
preg_match_all("~\b(?<![</])\w+\b(?![^<>]+>)~", $HTML, $words, PREG_OFFSET_CAPTURE);
foreach ($words[0] as $word) {
$HTMLWords[$word[1]] = $word[0];
}
return $HTMLWords;
}
此函数的输出如下:
Array
(
[0] => some
[5] => data
[10] => from
[38] => blahblah
[47] => test
[59] => was
[63] => not
[90] => statistically
[127] => valid
)
我们应该做的是将列表值的每个单词 - 连续地 - 与我们刚刚提取的单词相匹配。因此,作为我们的第一个列表的值not statistically valid
,我们有三个单词not
,statistically
和valid
,这些单词应该在提取的单词数组中连续出现。 (会发生这种情况)
为了解决这个问题,我写了一个函数:
/**
* Check if any of our defined list values can be found in an ordered-array of exctracted words
* @param [array] $HTMLWords
* @param [array] $listOfNegatives
* @return [array] $subString
*/
function checkNegativesExistence($HTMLWords, $listOfNegatives) {
$counter = 0;
$previousWordOffset = null;
$subStrings = [];
foreach ($listOfNegatives as $i => $string) {
$stringWords = explode(" ", $string);
$wordIndex = 0;
foreach ($HTMLWords as $offset => $HTMLWord) {
if ($wordIndex > count($stringWords) - 1) {
$wordIndex = 0;
$counter++;
}
if ($stringWords[$wordIndex] == $HTMLWord) {
$subStrings[$counter][] = [$HTMLWord, $offset, $previousWordOffset];
$wordIndex++;
} elseif (isset($subStrings[$counter]) && count($subStrings[$counter]) > 0) {
unset($subStrings[$counter]);
$wordIndex = 0;
}
$previousWordOffset = $offset + strlen($HTMLWord);
}
$counter++;
}
return $subStrings;
}
其输出如下:
Array
(
[0] => Array
(
[0] => Array
(
[0] => not
[1] => 63
[2] => 62
)
[1] => Array
(
[0] => statistically
[1] => 90
[2] => 66
)
[2] => Array
(
[0] => valid
[1] => 127
[2] => 103
)
)
)
如果你看到我们有一个完整的字符串分为单词及其偏移量(我们有两个偏移量,第一个是实际偏移量,第二个是前一个词的偏移量)。我们以后需要它们。
现在我们应该考虑的另一件事是用62
将此事件从偏移127 + strlen(valid)
替换为<span class="negative">not statistically valid</span>
并忘记其他所有事情。
/**
* Substitute newly matched strings with negative HTML wrapper
* @param [array] $subStrings
* @param [string] $HTML
* @return [string] $HTML
*/
function negativeHighlight($subStrings, $HTML) {
$offset = 0;
$HTMLLength = strlen($HTML);
foreach ($subStrings as $key => $value) {
$arrayOfWords = [];
foreach ($value as $word) {
$arrayOfWords[] = $word[0];
if (current($value) == $value[0]) {
$start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
}
if (current($value) == end($value)) {
$defaultLength = $word[1] + strlen($word[0]) - $start;
$length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
}
}
$string = implode(" ", $arrayOfWords);
$HTML = substr_replace($HTML, "<span class=\"negative\">{$string}</span>", $start, $length);
if ($HTMLLength > strlen($HTML)) {
$offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
$offset = strlen($HTML) - $HTMLLength;
}
}
return $HTML;
}
我应该注意的一件重要的事情是,通过第一次替换我们可能会影响其他提取值的偏移(我们这里没有)。因此,需要计算新的HTML长度:
if ($HTMLLength > strlen($HTML)) {
$offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
$offset = strlen($HTML) - $HTMLLength;
}
和...我们应该检查是否通过这个长度的变化我们的偏移是如何改变的:
此检查由此块完成(我们只需检查第一个和最后一个字):
if (current($value) == $value[0]) {
$start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
}
if (current($value) == end($value)) {
$defaultLength = $word[1] + strlen($word[0]) - $start;
$length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
}
一起做:
$newHTML = negativeHighlight(checkNegativesExistence(extractWords($HTML), $listOfNegatives), $HTML);
输出:
some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span></span></span>
但是我们的上一次输出存在问题:无法匹配的标签。
我很抱歉我撒了谎我已经完成了4个步骤解决这个问题,但还有一个。在这里,我创建了另一个RegEx来匹配所有真正嵌套的标签和那些错误存在的标签:
~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~
preg_replace_callback()
我只会将名为single
的群组中的标记替换为
echo preg_replace_callback("~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~",
function ($match) {
if (isset($match['single'])) {
return null;
}
return $match[1];
},
$newHTML
);
我们有正确的输出:
some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>
我的解决方案在以下情况下无法输出正确的HTML:
1-如果<was>
之类的单词在其他单词之间:
<span class="positive">blahblah test</span> <was> not
<强>为什么吗
<was>
标记为无与伦比的标记,所以它将为
删除它。 2-如果像not
这样的单词(这是负面列表中的值的一部分)
我们的列表附有<>
- &gt; <not>
。哪个输出:
some data from <span class="positive">blahblah test</span> was <not> <span class="positive">statistically <span class="positive">valid</span></span>
<强>为什么吗
<>
3-如果list的值为one是另一个子字符串:
$listOfNegatives = ['not statistically valid', 'not statistically'];
<强>为什么吗
<强> Working demo 强>
答案 1 :(得分:1)
这是我提出的。老实说,我不能说它是否能满足所有要求,但它可能会有所帮助
$s = 'some data from blahblah test was not statistically valid';
$replaced = highlight($s);
var_dump($replaced);
function highlight($s) {
// split the string on the negative parts, capturing the full negative string each time
$parts = preg_split('/(not statistically valid)/',$s,-1,PREG_SPLIT_DELIM_CAPTURE);
$output = '';
$negativePart = 0; // keep track of whether we're dealing with a negative or part or the remainder - they will alternate.
foreach ($parts as $part) {
if ($negativePart) {
$output .= negativeHighlight($part);
} else {
$output .= positiveHighlight($part);
}
$negativePart = !$negativePart;
}
return $output;
}
// only deals with a single negative part at a time, so just wraps with a span
function negativeHighlight($part) {
return "<span class='negative'>$part</span>";
}
// potentially deals with several replacements at once
function positiveHighlight($part) {
return preg_replace('/(blahblah test)|(statistically valid)/', "<span class='positive'>$1</span>", $part);
}