如果使用domdocument后存在术语,则获取标签的内容

时间:2018-08-20 13:19:28

标签: php xpath domdocument

拥有此$html

$html = '<p>random</p>
<a href="">Test 1</a> (target1)
<br>
<a href="">Test 2</a>  (target1)
<br>
<a href="">Test 3</a> (skip)
// etc
';

我在$array中有几句话:

$array = array(
    '(target1)',
    '(target2)'
);

如何使用domdocument浏览$html来查找$array中的所有术语并获取其前面的<a>标记的内容?

所以我最终得到以下结果:

$results = array(
    array(
        'text' => 'Test 1',
        'needle' => 'target1'
    ),
    array(
        'text' => 'Test 2',
        'needle' => 'target1'
    )
);

到目前为止我已经尝试过的

通过以下方法,我设法获取了<a>中所有$html标签的内容:

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
$xpath = new DOMXPath($doc);

$elements = $xpath->query('//a'); 
$el_array = array();
if ($elements->length > 0) {
    foreach($elements as $n) {
        $node = trim(strip_tags($n->nodeValue));
        if (!empty($node)) {
            $el_array[] = $node;
        }
    }
    if (!empty($el_array) && is_array($el_array)) {
    print_r($el_array);
    }
}

但是我还没有找到一种方法来获取目标词,以便检查我们是否有匹配项。

3 个答案:

答案 0 :(得分:3)

您可以使用contains和following-sibling创建动态xpath查询。

xpath表达式将是:

//a/following-sibling::text()[contains(., '(target1)') or contains(., '(target2)')]

例如:

$array = array(
    '(target1)',
    '(target2)'
);

$contains =  implode(" or ", array_map(function($x) {
    return "contains(., '$x')";
}, $array));

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
$xpath = new DOMXPath($doc);
$elements = $xpath->query("//a/following-sibling::text()[$contains]");
$results = [];

foreach ($elements as $element) {
    $results[] = [$element->previousSibling->nodeValue, trim($element->nodeValue)];
}

print_r($results);

结果:

Array
(
    [0] => Array
        (
            [0] => Test 1
            [1] => (target1)
        )

    [1] => Array
        (
            [0] => Test 2
            [1] => (target2)
        )

)

Demo

答案 1 :(得分:1)

每次遇到并锚定保存他的值时,您都可以遍历解析的dom,然后检查节点值是否在数组(target1,target2)内,如果为true,则存储在其中$result当前节点和旧锚文本。

<?php
    $html = '<p>random</p>
    <a href="">Test 1</a> (target1)
    <br>
    <a href="">Test 2</a>  (target1)
    <br>
    <a href="">Test 3</a> (skip)
    // etc
    ';

    $array = array(
        '(target1)',
        '(target2)'
    );

    $result = array();
    $doc = new DOMDocument();
    $doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
    $xpath = new DOMXPath($doc);
    $test = showDOMNode($doc,$array);
    print_r($result);

    function showDOMNode(DOMNode $domNode,$array,$oldval=false) {
        global  $result;
        foreach ($domNode->childNodes as $node){
            $nodename = $node->nodeName;
            $nodevalue = $node->nodeValue;
            if($nodename == "a"){
                $oldval = $nodevalue;
            }
            if(in_array(trim ($nodevalue),$array)){
                $tmp = array(
                    "text"=> $oldval,
                    "needle" =>$nodevalue
                    );
               $result[] = $tmp;
            }
            if($node->hasChildNodes()) {
                showDOMNode($node,$array,$oldval);
            }
        }    
    }

它输出:

Array ( 
[0] => Array ( [text] => Test 1 [needle] => (target1) ) 
[1] => Array ( [text] => Test 2 [needle] => (target1) ) 
) 

答案 2 :(得分:0)

对不起-没有找到您需要的解决方案domdocument:/


我认为应该这样做:

$html = '
<p>random</p>
<a href="page1.php">Test 1</a> (target1)
<br>
<a href="page2.htm">Test 2</a>  (target1)
<br>
<a href="page3.html">Test 3</a> (skip)
// etc
';

$array = array(
    '(target1)',
    '(target2)'
);

#Explode HTML into new lines, to run through each line

$lines  = explode("\n", $html);

foreach ($lines as $line){

    #Find pattern from $array, and if match, use preg_match_all to find the text in the a-tag
    if(str_replace($array, '', $line) != $line){
        preg_match_all('/<a href=\".*\">(.*?)<\/a>/s', $line, $matches);

        print_r($matches[1]);
    }
}

输出

Array
(
    [0] => Test 1
)
Array
(
    [0] => Test 2
)