Question

我有一个脚本，我试图用simple_html_dom完成我可以刮掉我想要的网页，链接无效。我想让链接有效，所以我一直在尝试不同的东西，而不是让它工作。我可以让它刮，或修复以前保存的页面中的链接，但我似乎无法刮取链接，并修复链接，以便他们引用正确的域。

我可能会误用或误解如何使用simplehtmldom＆＃34; s＆＃34; save＆＃34;功能。

这就是我现在所拥有的：

<?php
include 'simple_html_dom.php';
$file1 = "http://www.indeed.com/jobs?q=Electrician&l=maine";
$file2 = "http://www.indeed.com/jobs?q=Electronic&l=maine";
$file3 = "http://www.indeed.com/jobs?q=Electronics+Tech&l=maine"; 
$file4 = "http://www.indeed.com/jobs?q=Helpdesk&l=maine";
$file5 = "http://www.indeed.com/jobs?q=Trades&l=maine";
$SEARCH = array($file1, $file2, $file3, $file4, $file5);

//Fix links
$domain = "http://www.indeed.com";
$rep['/href="(?!https?:\/\/)(?!data:)(?!#)/'] = 'href="'.$domain;
$rep['/src="(?!https?:\/\/)(?!data:)(?!#)/'] = 'src="'.$domain;
$rep['/@import[\n+\s+]"\//'] = '@import "'.$domain;
$rep['/@import[\n+\s+]"\./'] = '@import "'.$domain;

//Find this: data-tn-component="organicJob"
//<div class="  row  result" id="p_a8a968e2788dad48" data-jk="a8a968e2788dad48" itemscope itemtype="http://schema.org/JobPosting" data-tn-component="organicJob">

$html = new simple_html_dom();

for ($i = 0; $i<6; $i++)
{
    $html->load_file($SEARCH[$i]);

    foreach($html->find('div[data-tn-component="organicJob"]') as $div)
    {
      $str = $html->save($div);
      $output = preg_replace(array_keys($rep), array_values($rep), $str);
       echo  $output->innertext . "\n";
    }
}

?>

如何抓取页面，并修复指向正确域名的链接？

使用simplehtmldom

0 个答案: