我已经用PHP编写了一个网络爬虫,我给eBay抓了它。它抓取给定网页中的所有链接,但有时可以提供相同链接的多个网址。它在我的数据库中强调我,我不知道如何调整代码。
<?php
session_start();
$domain = "www.ebay.com";
if(empty($_SESSION['page']))
{
$original_file = file_get_contents("http://" . $domain . "/");
$_SESSION['i'] = 0;
$connect = mysql_connect("xxxxxx", "xxxxxxxxxx", "xxxxxxxxxxxx");
if (!$connect)
{
die("MySQL could not connect!");
}
$DB = mysql_select_db('xxxxxxxxxxxxx');
if(!$DB)
{
die("MySQL could not select Database!");
}
}
if(isset($_SESSION['page']))
{
$connect = mysql_connect("xxxxxxxxxxxxx", "xxxxxxxxxxxxx", "xxxxxxxxxxxx");
if (!$connect)
{
die("MySQL could not connect!");
}
$DB = mysql_select_db('xxxxxxxx');
if(!$DB)
{
die("MySQL could not select Database!");
}
$PAGE = $_SESSION['page'];
$original_file = file_get_contents("$PAGE");
}
$stripped_file = strip_tags($original_file, "<a>");
preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);
foreach($matches[1] as $key => $value)
{
if(strpos($value,"http://") != 'FALSE' && strpos($value,"https://") != 'FALSE')
{
$New_URL = "http://" . $domain . $value;
}
else
{
$New_URL = $value;
}
$New_URL = addslashes($New_URL);
$Check = mysql_query("SELECT * FROM pages WHERE url='$New_URL'");
$Num = mysql_num_rows($Check);
if($Num == 0)
{
mysql_query("INSERT INTO pages (url)
VALUES ('$New_URL')");
$_SESSION['i']++;
echo $_SESSION['i'] . "";
}
echo mysql_error();
}
$RandQuery = mysql_query("SELECT DISTINCT * FROM pages ORDER BY RAND() LIMIT 0,1");
$RandReturn = mysql_num_rows($RandQuery);
while($row1 = mysql_fetch_assoc($RandQuery))
{
$_SESSION['page'] = $row1['url'];
}
echo $RandReturn;
echo $_SESSION['page'];
mysql_close();
?>
答案 0 :(得分:0)
首先你的链接刮刀有点问题:
您正在使用,
if(strpos($value,"http://") != 'FALSE' && strpos($value,"https://") != 'FALSE')
{
$New_URL = "http://" . $domain . $value;
}
else
{
$New_URL = $value;
}
剥离所有标签后。
问题是,如果链接HREF如下:
<a href='#' ...> or <a href='javascript:func()'> or <a href='img...'> etc...
它会为你准备一个你不需要的无效网址,你应该strpos()或reg_match() 对于这个独特的案例(以及其他一些案例)来逃避它们。
此外,您还需要考虑转义链接到文件的URL,例如:jpg,png,avi,wmv,zip等......
现在提问:
首先需要将目标页面的所有URL保存在一个数组中,然后执行此操作 需要转储此数组中的所有重复值 - 这将最大限度地减少时间 SQL查询将消耗...
使用www.ebay.com快速测试:
before cleaning duplicate URL's: 196.
after cleaning: 120.
现在使用:
SELECT EXISTS(SELECT 1 FROM table1 WHERE ...)
要检查您的数据库中是否已存在URL ...它更快,更可靠。
通过我的更改观察您的代码:
$stripped_file = strip_tags($original_file, "<a>");
preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);
$file_arr = array('#\.jpg$#i', '#\.png$#i', '#\.mpeg$#i', '#\.bmp$#i', '#\.gif$#i', '#\.wmv$#i', '#\.mp3$#i', '#\.avi$#i'); //add files to avoid them.
$avoid = 0; //check if it is a url that links to a file. [0-no, 1-yes]
foreach($matches[1] as $key => $value)
{
$value = preg_replace('/\#.*/i', '', $value); //removes pages position.
if(strpos($value,"http://") != 'FALSE' &&
strpos($value,"https://") != 'FALSE' &&
strpos($value,"javascript") != 'FALSE' &&
strpos($value,"javascript:func") != 'FALSE' &&
$value != '')
{
foreach($file_arr as $val_reg) { preg_match($val_reg, $value, $res); if (isset($res[0])) { $avoid=1; break 1; } } //check all the file conditions
$value = preg_replace('#\/$#i', '', $value) //force '/' at the end of the URL's
if ($avoid==0) { $New_URL[$key] = "http://" . $domain . $value . "/"; }
}
else
{
if(strpos($value,"javascript") != 'FALSE' &&
strpos($value,"javascript:func") != 'FALSE' &&
$value != '')
{
foreach($file_arr as $val_reg) { preg_match($val_reg, $value, $res); if (isset($res[0])) { $avoid=1; break 1; } }//check all the file conditions
$value = preg_replace('#\/$#i', '', $value) //force '/' at the end of the URL's
if ($avoid==0) { $New_URL[$key] = $value . "/"; }
}
}
}
//check for duplicate before storing the URL:
foreach($New_URL as $check)
{
$check = mysql_real_escape_string($check);
$Check_ex = "SELECT EXISTS (SELECT 1 FROM pages WHERE url='$check' LIMIT 1)"; // EXISTS will RETURN 1 if exists ...
if (@mysql_num_rows(mysql_query($Check_ex))!=1) {
//Insert your query here......
}
else
{
//Dont store your query......
}
}
不是最干净的代码,但它应该有用......