给出以下文字:
<p style="color: blue">Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p> // Should match
<p style="margin-left: 10px">* Item 2</p>
<p style="margin-left: 20px">* Sub Item 1a</p> // Should match
<p style="margin-left: 20px">* Sub Item 2a</p>
<p style="margin-left: 10px">* Item 3</p>
<p style="margin-left: 20px">* Sub Item 1b</p> // Should match
<p style="margin-left: 20px">* Sub Item 2b</p>
<p style="margin-left: 30px">* Sub Item 1c</p> // Should match
<p>Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p> // Should match
我正在尝试查找符合以下条件的任何p
元素:
margin-left
内联样式p
元素p
元素margin-left
低于匹配元素所以在这个例子中,我需要匹配以下元素:
<p style="color:blue; margin-left: 10px">* Item 1</p> (preceding element is a p but doesn't have any margin-left)
<p style="margin-left: 20px">* Sub Item 1a</p> (preceding element is a p but has a different margin-left value)
<p style="margin-left: 20px">* Sub Item 1b</p> (preceding element is a p but has a different margin-left value)
<p style="margin-left: 30px">* Sub Item 1c</p> (preceding element is a p but has a margin-left value lower than the current matched element)
<p style="color:blue; margin-left: 10px">* Item 1</p> (preceding element is a p but has no margin-left value)
我无法使用DomDocument
因为我收到的标记并不总是有效的标记(通常来自Microsoft Office&gt; HTML转换),所以我使用正则表达式来解决问题。
我现在的正则表达式是:
(?!<p style=".*?(margin-left:\s?(?!\k'margin')px;).*?">\* .*?<\/p>)<p style="(?P<styles>.*?)margin-left:\s?(?P<margin>[0-9]{1,3})px;?">\* (?P<listcontent>.*)<\/p>
但这仅根据现有的前面元素匹配为p
且margin-left
。
如何计算匹配的margin-left
组并返回大于上一次匹配的值?
我创建了一个online regex来演示问题,包括示例数据和我当前的输出。
答案 0 :(得分:0)
此代码按预期使用正则表达式获取每个元素然后循环迭代它们并检查业务逻辑:
<?php
$data = '<p style="color: blue">Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p>
<p style="margin-left: 10px">* Item 2</p>
<p style="margin-left: 20px">* Sub Item 1a</p>
<p style="margin-left: 20px">* Sub Item 2a</p>
<p style="margin-left: 10px">* Item 3</p>
<p style="margin-left: 20px">* Sub Item 1b</p>
<p style="margin-left: 20px">* Sub Item 2b</p>
<p style="margin-left: 30px">* Sub Item 1c</p>
<div>Some text</div>
<p style="color:blue; margin-left: 10px">* Item 1</p>';
// Get all HTML tags, the element in [1], the attributes (style etc) in [2], the content in [3]
preg_match_all("/<(\w+)\b([^>]+)*>(.*?)<\/\w+>/", $data, $matches);
$results = [];
// Keep track of last element margin-left, if it's is missing it will be set to 0 making the next
// element included automatically if it has a margin-left
$lastMarginLeft = 0;
// Loop through matches and apply business rules
for ($i = 0; $i <= count($matches[0]); $i++) {
/**
* Business rules:
* - Contents begins with an asterisk character
* - Elements have a margin-left inline style
* - The preceding content is either:
* - A p element which has no margin-left
* - A p element with a margin-left which is lower than the matched element
* - Any other element
*/
// Assume no margin-left found by default
$marginLeft = 0;
// Check element has a margin-left
if (strpos($matches[2][$i], 'margin-left') !== false) {
// Extract margin-left value
preg_match("/margin-left:\s?(\d+)/", $matches[2][$i], $value);
$marginLeft = isset($value[1]) ? $value[1] : 0;
// Check if this margin is greater than the last
if ($marginLeft > $lastMarginLeft) {
// Check content
if (strpos($matches[3][$i], '*') === 0) {
$results[] = $matches[0][$i];
}
}
}
// Capture margin left for next run
$lastMarginLeft = $marginLeft;
}
// Results:
// Array
// (
// [0] => <p style="color:blue; margin-left: 10px">* Item 1</p>
// [1] => <p style="margin-left: 20px">* Sub Item 1a</p>
// [2] => <p style="margin-left: 20px">* Sub Item 1b</p>
// [3] => <p style="margin-left: 30px">* Sub Item 1c</p>
// [4] => <p style="color:blue; margin-left: 10px">* Item 1</p>
// )