我使用cURL和DOM在PHP中编写了一个非常基本的网络抓取工具。我使用XAMPP(Apache和MySQL)在Windows 10机器上本地运行它。它在一个特定网站上在400页(总共约2,000个值)上删除了大约5个值。该工作通常在< 120秒,但间歇性地(大约每5次运行一次)它会在60秒左右停止,并出现以下错误:
Recv failure:连接已重置
可能无关紧要,但我所有被抓取的数据都被抛入MySQL表中,而一个单独的.php文件正在设计数据并呈现它。这部分工作正常。错误是由cURL抛出的。这是我(非常修剪)的代码:
$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');
//Some code that creates my SQL table.
//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){
//3 array variables defined here, which I didn't include in this example.
$path = $uniqueItems->href;
$url = 'http://IPAddressOfSiteIAmScraping' . $path;
//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
echo 'Scraping error: ' . curl_error($curl);
exit; }
curl_close($curl);
//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
@$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.
//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[@id="wrapper"]//p)[@class="header"][1]');
$price = $xpath->query('//tr[@class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[@class="price_tr"]/td[3]');
$league = $xpath->query('//td[@class="left-column"]/p[1]');
//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
$temp = new DOMDocument();
$temp->appendChild($temp->importNode($e,TRUE));
$val = $temp->saveHTML();
$val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
$val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
$val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
$final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
$item['title'] = $final; //Here's the item name, ready for the SQL table.
}
//Here's a bunch of code where I write to my SQL table. Again, this part works great!
}
如果我需要抛弃DOM,我不反对切换到正则表达式,但在选择DOM而不是正则表达式之前,我确实潜伏了三天。我花了很多时间研究这个问题,但我所看到的一切都说&#34; Recv失败:连接被同伴重置&#34;,这不是我的意思获得。我真的很沮丧,我必须寻求帮助 - 我到目前为止一直做得很好 - 只是随时随地学习。这是我用PHP编写的第一件事。
TL; DR:我写了一个cURL网络刮刀,只有80%的时间都能很好地工作。 20%的时间,由于未知原因,它出现错误&#34; Recv失败:连接被重置&#34;。
希望有人可以帮助我!! :)感谢阅读,即使你不能!
P.S。如果您想查看我的完整代码,请访问:http://pastebin.com/vf4s0d5L。
答案 0 :(得分:0)
经过长时间的研究(我在发布问题之前已经研究了几天),我已经陷入困境并接受了这个错误可能与我尝试的网站有关。刮伤,因此不受我的控制。
我确实设法解决了这个问题,所以我会在这里放下我的解决方法......
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
if(curl_errno($curl)) {
echo 'Scraping error: ' . curl_error($curl) . '</br>';
echo 'Dropping table...</br>';
$sql = "DROP TABLE table_item_info";
if (!mysqli_query($conn, $sql)) {
echo "Could not drop table: " . mysqli_error($conn);
}
mysqli_close($conn);
echo "TABLE has been dropped. Restarting.</br>";
goto start;
exit; }
curl_close($curl);
基本上,我所做的是实施错误检查。如果错误出现在curl_errno($ curl)下,我认为它是连接重置错误。在这种情况下,我删除我的SQL表,然后使用&#34; goto start&#34;跳回我的脚本的开头。然后,在我的文件的顶部,我有&#34;开始:&#34;
这解决了我的问题!现在我不需要担心连接是否重置。我的代码很聪明,可以自行确定并重置脚本(如果是这种情况)。
希望这有帮助!