Question

我会尝试在文本中比在标题中更清晰。

我已经构建了一个php页面，它抓住另一个网站并将结果存储在数组中，而不是数据库中（重复155次，这些多次调用基于另一个数组）。

为了获得更快的结果，我已经实现了另一个php页面，它使用fopen（）多次调用“抓取页面”（大约5五次）将原始数组分成5个部分。

每次我都称之为抓取页面，一次又一次地迭代155次。但是当我使用fopen（）时，它开始返回我（有时）这个错误：

ContentPresenter

所以我想它应该是一个“多处理”的元素，所以如果我激活刮掉太多时间，它会给我带来错误。

因此，我试图将“抓取页面”调用3到2次，而不是给脚本休息（sleep（1）），而不是将其他2/3次调用到抓取页面。在这种情况下，有时候我会使所有脚本完美地工作，有时候我总是会再次出现相同的错误。

这是我的代码的一部分。从SCRAPING PAGE（刮擦脚本）：

 Fatal error: Call to a member function getElementsByTagName() on a non-object

错误始终与本守则的这一部分有关：

function taxExtract($countryList,$urlTax,$countryID,$countryName,$countryTag) {

 echo $urlTax;

 $optionsTax = Array(
            CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
            CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
            CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
            CURLOPT_CONNECTTIMEOUT => 300,   // Setting the amount of time (in seconds) before the request times out
            CURLOPT_TIMEOUT => 300,  // Setting the maximum amount of time for cURL to execute queries
            CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
            CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
            CURLOPT_URL => $urlTax, // Setting cURL's URL option with the $url variable passed into the function
 );

 $TaxCurl = curl_init($urlTax);
 curl_setopt_array($TaxCurl, $optionsTax);   // Setting cURL's options using the previously assigned array data in $options
 $resultTaxCurl = curl_exec($TaxCurl);
 $htmlTax = $resultTaxCurl;

 $domTax = new DOMDocument();
 $htmlTax = $domTax->loadHTML($htmlTax);

 $domTax->preserveWhiteSpace = false;

 $taxFullArr = array();
                    $taxFullArr[] = array (
                    'countryID' => $countryID,
                    'countryName' => $countryName,
                    'countryTag' => $countryTag);

 $alltaxtables = $domTax->getElementsByTagName('table');

 if($alltaxtables->length > 1) { // GET ONLY THE FIRST TABLE IF THERE ARE MORE THAN 1
    $taxtable = $alltaxtables->item(2);
 }

        $taxrows = $taxtable->getElementsByTagName("tr");
            foreach($taxrows as $taxrow) {
                $taxcols = $taxrow->getElementsByTagName('td'); 
                if (($taxcols->item(0)->nodeValue != "Resource") and ($taxcols->item(1)->nodeValue != "VAT") and ($taxcols->item(2)->nodeValue != "Import Tax") and ($taxcols->item(3)->nodeValue != "Income Tax")) {
                    echo "this is Country ID: ".$countryID." - ";
                    echo "this is Country Name: ".$countryName." - ";
                    echo "this is Country Tag: ".$countryTag." - ";
                    echo "this is Resource: ".$taxRes = $taxcols->item(0)->nodeValue." - "; 
                    echo "this is Vat tax: ".trim($taxIva = $taxcols->item(1)->nodeValue)." - "; 
                    echo "this is Import tax: ".trim($taxImport = $taxcols->item(2)->nodeValue)." - "; 
                    echo "this is Work tax: ".trim($taxWork =  $taxcols->item(3)->nodeValue)." - "; 

                $taxList[] = array (
                    'taxRes' => $taxRes,
                    'taxVat' => $taxIva,
                    'taxImport' => $taxImport,
                    'taxIncome' => $taxWork
                );

            }}
    $taxFullArr[] = $taxList;
 };

FROM MULTY PROCESS PAGE（多进程脚本）：

 $taxrows = $taxtable->getElementsByTagName("tr");

你知道为什么会这样吗？你知道我怎么纠正它？

请求进一步解释，对不起，如果我还不够清楚的话。

阿尔贝托

Answer 1

if (!taxtable) throw new SomeException();

将您的抓取逻辑放在一个函数中，然后try执行该函数并以这种方式检查您的错误。

无法真正帮助您刮取我不知道数据的网页。您能从curl请求和样本表中提供示例数据吗？

使用DOM和多进程fopen（）函数进行PHP抓取，在分析html时返回错误（在非对象上调用函数getElementsByTagName（））

1 个答案: