Scrapper Insert数据库中的一些奇怪的字符

时间:2017-02-26 18:33:29

标签: php sql simple-html-dom

我正在从网页抓取一些网址,并且它在页面上显示正常,但是当我将网址插入数据库时​​,它插入了一些像这样的奇怪

http://westseattleblog.com/event/west-seattle-church-listings/?instance_id=567059

我的代码

foreach($html->find('div[class=ai1ec-btn-group ai1ec-actions] a') as $element)
{
    $url= $element->href;
    $url1=mysql_real_escape_string($url);
    $sql="insert into catlink(catlink) values('$url1')";
    //echo $sql."<br>";
    $query=mysql_query($sql);
    //newpage
} 

当我开始从数据库中提取url并逐个删除时,它什么也没显示。

我的代码

$sql1="select * from links limit 10";
$query1=mysql_query($sql1);
while($res=mysql_fetch_assoc($query1)){
    $url=$res['url'];

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    // curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
    // curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    $page = curl_exec($ch);
    curl_close($ch);
    $dom = new simple_html_dom();
    $html = $dom->load($page);
    foreach($html->find("div") as $a){
        echo $a->innertext;
    }
    //$separator = '&nbsp;-&nbsp;';
}

1 个答案:

答案 0 :(得分:0)

您的网址为hex characters,因此您需要使用html_entity_decode 在将其插入数据库之前或在将其与cURL一起使用之前对其进行解码

所以:

$url1=mysql_real_escape_string(html_entity_decode($url));

$url=html_entity_decode($res['url']);