Question

好吧，我要抓取的页面具有以下结构

<span id="1">
    <a href="https://example.com">+</a>
    <span title="1">DATA HERE</span>
    <a href="https://example.com">DATA HERE</a> 
    <a href="https://example.com">DATA HERE</a>
</span>
<span id="2">
    <a href="https://example.com">+</a>
    <span title="1">DATA HERE</span>
    <a href="https://example.com">DATA HERE</a> 
    <a href="https://example.com">DATA HERE</a>
</span>

页面上有128条记录（跨度为ID值）

我正在使用以下代码，但是它保存的数据非常好，但是我需要用a分隔每个href属性值，直到它到达ID范围内的最后一个，然后我需要使用PHP_EOL移到新行

请帮助我拔头发

代码：

do {
    foreach($doc->getElementsByTagName('span') as $element ) { 

        if (!empty($element->getAttribute('id'))){

            foreach($doc->getElementsByTagName('a') as $ahref ) {

                if ($ahref->hasAttribute('href')) { 
                    $filename = 'test2/'.$f.'.txt';
                    $file = fopen($filename,"a");

                    $data = $ahref->getAttribute('href').',';
                    fwrite($file,$data);
                    fclose($file);
                }
            }
        }
    }
}

Answer 1

这是一段使用DomDocument和DomXPath的代码，我认为它将为您提供所需的结果。它查找具有id属性的所有范围，然后迭代其子项以查找a元素。找到一个后，就将其href属性添加到该跨度的hrefs列表中。处理完跨度的所有子级后，将输出hrefs的列表，并以逗号分隔一行。

$html = '<span id="1">
    <a href="https://example.com">+</a>
    <span title="1">DATA HERE</span>
    <a href="https://example.com">DATA HERE</a> 
    <a href="https://example.com">DATA HERE</a>
</span>
<span id="2">
    <a href="https://example.com">+</a>
    <span title="1">DATA HERE</span>
    <a href="https://example.com">DATA HERE</a> 
    <a href="https://example.com">DATA HERE</a>
</span>';
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$spans = $xpath->query("//span[@id]");
foreach ($spans as $span) {
    $hrefs = array();
    foreach ($span->childNodes as $n) {
        if ($n->nodeName == 'a') {
            $hrefs[] = $n->attributes->getNamedItem('href')->nodeValue;
        }
    }
    echo implode(',', $hrefs) . "\n";
}

输出：

https://example.com,https://example.com,https://example.com 
https://example.com,https://example.com,https://example.com

从ID范围内刮取href属性值

1 个答案: