如何过滤或提取不包含span的链接元素,并使用symfony dom crawler将它们保存在逗号分隔的数组中?

时间:2017-05-29 15:22:56

标签: php arrays symfony xpath domcrawler

<span class="tl">
<a href="/en/laravel/" class="c">laravel</a>, <span>goutte</span>, <a href="/en/html/">html</a>, <span>dom crawler</span>, <a href="/en/form/">form</a><span>guzzle</span>, <span>web scrapper</span>
</span>
<span class="tl">
<a href="/en/laravel/" class="c">laravel</a>, <span>goutte</span>, <a href="/en/elequent/">elequent</a>, <span>dom crawler</span>, <span>guzzle</span>, <a href="/en/orm/">orm</a>, <span>web scrapper</span>
</span>
<span class="tl">
<a href="/en/laravel/" class="c">laravel</a>, <a href="/en/goutte">goutte</a>, <a href="/en/php/">php</a>, <span>dom crawler</span>, <a href="/en/guzzle">guzzle</a>, <a href="/en/web-scrapper">web scrapper</a>
</span>

我想在像这样的数组中提取信息

array (size=3)
  0 => string 'laravel, html, form' (length=19)
  1 => string 'laravel, elequent, orm' (length=22)
  2 => string 'laravel, goutte, php, guzzle, web scrapper' (length=43)

1 个答案:

答案 0 :(得分:1)

Try this code snippet here

<?php
ini_set('display_errors', 1);

$string=<<<HTML

<span class="tl">
<a href="/en/laravel/" class="c">laravel</a>, <span>goutte</span>, <a href="/en/html/">html</a>, <span>dom crawler</span>, <a href="/en/form/">form</a><span>guzzle</span>, <span>web scrapper</span>
</span>
<span class="tl">
<a href="/en/laravel/" class="c">laravel</a>, <span>goutte</span>, <a href="/en/elequent/">elequent</a>, <span>dom crawler</span>, <span>guzzle</span>, <a href="/en/orm/">orm</a>, <span>web scrapper</span>
</span>
<span class="tl">
<a href="/en/laravel/" class="c">laravel</a>, <a href="/en/goutte">goutte</a>, <a href="/en/php/">php</a>, <span>dom crawler</span>, <a href="/en/guzzle">guzzle</a>, <span>web scrapper</span>
</span>

HTML;

$domDocument = new DOMDocument();
$domDocument->loadHTML($string);

$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query('//span[@class="tl"]');
$data=array();
foreach($results as $result)
{
    $tempArray=array();
    $aNodes=$domXPath->query(".//a",$result);
    foreach($aNodes as $aNode)
    {
        if($aNode instanceof DOMElement)
        {
            $tempArray[]=$aNode->nodeValue;
        }
    }
    $data[]=  implode(", ", $tempArray);
}
print_r($data);