我正在尝试抓取一个网站,以便创建其中所有iframes源的列表,为了这样做,我使用了simple_html_dom php库。
我解析每个链接,在每个链接中我搜索iframe,iframe存在,我要求代码为我带来src和从中获取的页面。
我在做的是:
抓住所需页面中的所有链接。
将它们转换为一个大数组,以避免服务器崩溃。
会发生什么事情,它适用于20行,而不是我得到这个错误:
Warning: file_get_contents(): php_network_getaddresses: getaddrinfo failed: No such host is known. in C:\xampp\htdocs\scrap\simple_html_dom.php on line 75
Warning: file_get_contents(http://www.achva.ac.ilhttp://www.achva.ac.il/לימודי-תעודה-והשתלמויות): failed to open stream: php_network_getaddresses: getaddrinfo failed: No such host is known. in C:\xampp\htdocs\scrap\simple_html_dom.php on line 75
Fatal error: Call to a member function find() on a non-object in C:\xampp\htdocs\scrap\scrap.php on line 61
由于某些原因我一直收到错误:
这是我的代码:
<!DOCTYPE html>
<html>
<head>
<title></title>
<style type="text/css">
th{
font-weight: 800;
border: 1px solid lightblue;
}
td{
border: 1px solid lightblue;
}
</style>
</head>
<body>
<?php
$html = file_get_contents('http://www.achva.ac.il/sitemap');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
$arr = [];
foreach ($links as $link){
array_push($arr, 'http://www.achva.ac.il'.$link->getAttribute('href'));
}
$result = count($arr);
echo $result;
?>
<?php
function urlOk($url) {
$headers = @get_headers($url);
if($headers[0] == 'HTTP/1.1 200 OK') return true;
else return false;
}
*/
include('simple_html_dom.php');
echo '<table><tr><th>id</th><th>Video src</th><th>Site page</th></tr>';
$i = 19;
$page_number = 1;
foreach($arr as $urlx){
echo $urlx;
$scrap_url = file_get_html($urlx);
if (preg_match('#^http?://(?:[^.]+\.)*achva\.ac.il/#i', $urlx))
{
$div = $scrap_url->find('iframe');
if($div){
foreach ($div as $key) {
echo '<tr><td>' . $i . '</td>';
$src = $key->attr['src'];
echo '<td>' . $src . '</td>';
echo '<td>' .$urlx . '</td></tr>';
$page_number++;
}
}else{
echo '<tr><td>' . $i . '</td><td>no iframe in this tage</td><td>' . $urlx . '</td></tr>';
}
$i++;
}
}
?>
</table>
</body>
</html>