Question

我正在使用PHP scraper执行以下操作：

cURL多个（总是少于10个）网址，
将每个网址的HTML添加到DOMDocument，
解析链接到PDF的<a>元素的DOM文档，
将href存储为数组中的匹配元素。

我有步骤1＆amp; 2 down（我的代码输出所有URL的组合HTML），但是当我尝试遍历结果以查找链接到PDF的元素时，我什么也得不到（空数组）。

我已经在单个cURL上尝试了我的解析器代码并且它可以正常工作（返回一个数组，其中包含该页面上每个pdf的网址）。

这是我的cURL代码：

$urls = Array( 
 'http://www.example.com/about/1.htm', 
 'http://www.example.com/about/2.htm',
 'http://www.example.com/about/3.htm',
 'http://www.example.com/about/4.htm' 
); 

# Make DOMDoc
$dom = new DOMDocument();

foreach ($urls as $url) { 
    $ch = curl_init($url);  
    $html = curl_exec($ch);
    # Exec and close CURL, suppressing errors
    @$dom->createDocumentFragment($html);
    curl_close($ch);
}

解析器代码：

#make pdf link array
$pdf_array = array();
# Iterate over all <a> tags and spit out those that end with ".pdf"
foreach($dom->getElementsByTagName('a') as $link) {
    # Show the <a href>
    $linkh = $link->getAttribute('href');
    $filend = ".pdf";
    # @ at beginning supresses string length warning
    @$pdftester = substr_compare($linkh, $filend, -4, 4, true);
    if ($pdftester === 0) {
        array_push($pdf_array, $linkh);
    }
}

完整代码如下所示：

<?php 

$urls = Array( 
 'http://www.example.com/about/1.htm', 
 'http://www.example.com/about/2.htm',
 'http://www.example.com/about/3.htm',
 'http://www.example.com/about/4.htm' 
); 

# Make DOM parser
$dom = new DOMDocument();

foreach ($urls as $url) { 
    $ch = curl_init($url);  
    $html = curl_exec($ch);
    # Exec and close CURL, suppressing errors
    @$dom->createDocumentFragment($html);
    curl_close($ch);
} 

#make pdf link array
$pdf_array = array();
# Iterate over all <a> tags and spit out those that end with ".pdf"
foreach($dom->getElementsByTagName('a') as $link) {
    # Show the <a href>
    $linkh = $link->getAttribute('href');
    $filend = ".pdf";
    # @ at beginning supresses string length warning
    @$pdftester = substr_compare($linkh, $filend, -4, 4, true);
    if ($pdftester === 0) {
        array_push($pdf_array, $linkh);
    }
}

print_r($pdf_array);

?>

有关我在DOM解析和PDF数组构建方面做错的建议吗？

Answer 1

<强> 1 为了将HTML内容导入$html，您需要set the CURL option CURLOPT_RETURNTRANSFER标志。否则，它只会将内容打印到页面并在$html中放置1（成功）。

CURLOPT_RETURNTRANSFER：TRUE，将传输作为curl_exec（）返回值的字符串返回，而不是直接输出。

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);

<强> 2 即可。 createDocumentFragment方法没有按照您的想法执行。

此函数创建DOMDocumentFragment类的新实例。除非与（例如）DOMNode :: appendChild（）一起插入，否则此节点不会显示在文档中。

因此它不会将HTML读入DOM文档。它甚至不需要$html参数。

如果你想跳过CURL并一次性将文件直接加载到DOM对象中，你可能最好使用loadHTML方法或loadHTMLFile。

@$dom->loadHTML($html);    // Like this
@$dom->loadHTMLFile($url); // or this (removing the CURL lines)

第3 即可。在将HTML加载到DOM对象后立即提取PDF链接是有意义的，而不是在提取之前尝试将所有页面合并为一个。你拥有的代码实际上工作得很好:-)

cURL多个URL＆amp;解析结果

1 个答案: