查找匹配组比使用正则表达式

时间:2017-06-11 10:26:14

标签: php regex pcre regex-negation regex-lookarounds

给出以下文字:

<p style="color: blue">Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p> // Should match
<p style="margin-left: 10px">* Item 2</p>
<p style="margin-left: 20px">* Sub Item 1a</p> // Should match
<p style="margin-left: 20px">* Sub Item 2a</p>
<p style="margin-left: 10px">* Item 3</p>
<p style="margin-left: 20px">* Sub Item 1b</p> // Should match
<p style="margin-left: 20px">* Sub Item 2b</p>
<p style="margin-left: 30px">* Sub Item 1c</p> // Should match
<p>Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p> // Should match

我正在尝试查找符合以下条件的任何p元素:

  • 以星号开头
  • 他们有margin-left内联样式
  • 前面的内容是:
    • 没有左边距的p元素
    • p元素margin-left低于匹配元素
    • 任何其他元素

所以在这个例子中,我需要匹配以下元素:

<p style="color:blue; margin-left: 10px">* Item 1</p> (preceding element is a p but doesn't have any margin-left)
<p style="margin-left: 20px">* Sub Item 1a</p> (preceding element is a p but has a different margin-left value)
<p style="margin-left: 20px">* Sub Item 1b</p> (preceding element is a p but has a different margin-left value)
<p style="margin-left: 30px">* Sub Item 1c</p> (preceding element is a p but has a margin-left value lower than the current matched element)
<p style="color:blue; margin-left: 10px">* Item 1</p> (preceding element is a p but has no margin-left value)

我无法使用DomDocument因为我收到的标记并不总是有效的标记(通常来自Microsoft Office&gt; HTML转换),所以我使用正则表达式来解决问题。

我现在的正则表达式是:

(?!<p style=".*?(margin-left:\s?(?!\k'margin')px;).*?">\* .*?<\/p>)<p style="(?P<styles>.*?)margin-left:\s?(?P<margin>[0-9]{1,3})px;?">\* (?P<listcontent>.*)<\/p>

但这仅根据现有的前面元素匹配为pmargin-left

如何计算匹配的margin-left组并返回大于上一次匹配的值?

我创建了一个online regex来演示问题,包括示例数据和我当前的输出。

1 个答案:

答案 0 :(得分:0)

此代码按预期使用正则表达式获取每个元素然后循环迭代它们并检查业务逻辑:

<?php

$data = '<p style="color: blue">Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p>
<p style="margin-left: 10px">* Item 2</p>
<p style="margin-left: 20px">* Sub Item 1a</p>
<p style="margin-left: 20px">* Sub Item 2a</p>
<p style="margin-left: 10px">* Item 3</p>
<p style="margin-left: 20px">* Sub Item 1b</p>
<p style="margin-left: 20px">* Sub Item 2b</p>
<p style="margin-left: 30px">* Sub Item 1c</p>
<div>Some text</div>
<p style="color:blue; margin-left: 10px">* Item 1</p>';

// Get all HTML tags, the element in [1], the attributes (style etc) in [2], the content in [3]
preg_match_all("/<(\w+)\b([^>]+)*>(.*?)<\/\w+>/", $data, $matches);

$results = [];

// Keep track of last element margin-left, if it's is missing it will be set to 0 making the next
// element included automatically if it has a margin-left
$lastMarginLeft = 0;

// Loop through matches and apply business rules
for ($i = 0; $i <= count($matches[0]); $i++) {
    /**
     * Business rules:
     * - Contents begins with an asterisk character
     * - Elements have a margin-left inline style
     * - The preceding content is either:
     *   - A p element which has no margin-left
     *   - A p element with a margin-left which is lower than the matched element
     *   - Any other element
     */

    // Assume no margin-left found by default
    $marginLeft = 0;

    // Check element has a margin-left
    if (strpos($matches[2][$i], 'margin-left') !== false) {
        // Extract margin-left value
        preg_match("/margin-left:\s?(\d+)/", $matches[2][$i], $value);
        $marginLeft = isset($value[1]) ? $value[1] : 0;

        // Check if this margin is greater than the last
        if ($marginLeft > $lastMarginLeft) {
            // Check content
            if (strpos($matches[3][$i], '*') === 0) {
                $results[] = $matches[0][$i];
            }
        }
    }

    // Capture margin left for next run
    $lastMarginLeft = $marginLeft;
}

// Results:
// Array
// (
//     [0] => <p style="color:blue; margin-left: 10px">* Item 1</p>
//     [1] => <p style="margin-left: 20px">* Sub Item 1a</p>
//     [2] => <p style="margin-left: 20px">* Sub Item 1b</p>
//     [3] => <p style="margin-left: 30px">* Sub Item 1c</p>
//     [4] => <p style="color:blue; margin-left: 10px">* Item 1</p>
// )