从网址中排除双重http

时间:2013-12-09 16:32:15

标签: php regex xpath

我指的是此链接从包含特定字词的网页中提取网址

regex to print url from any webpage with specific word in url

但很少有像pinterest和facebook referal url的网址包含对我感兴趣的字词,但我不想使用facebook,pinterest网址,因为它们不是直接网址,所以我想要排除这些网址,所以我观察到这些网址将包含至少两个http

类似这样的事情

http://www.pinterest.com/pin/create/button/?url=http%3A%2F%2Fwww.glamsham.com%2Fpicture-gallery%2Fsensual-in-saree-gallery%2Fspecials%2F3774%2F7%2Findex.htm&media=http%3A%2F%2Fmedia.glamsham.com%2Fdownload%2Fpicturegallery%2Ffeatured%2Fbollywood-beauties-saree%2F722-sensual-in-saree.jpg&guid=gNh5ehWodCZW-0&description=Rani%20Mukerji%20in%20saree%20at%20Sensual%20in%20saree%20picture%20gallery%20picture%20%23%207%20%3A%20glamsham.com

所以我想排除包含至少两个http

的网址

2 个答案:

答案 0 :(得分:0)

你可以试试这样的东西来避免这些URI:

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $href = $node->getAttribute('href');
    if ( !preg_match('~^http://.+?https?\b~i', $href) )
       echo "$href\n";
}

preg_match('~^http://.+?https?\b~i', $href)应与这些to-be-excluded URI匹配

答案 1 :(得分:0)

我可能会检查你循环浏览它们并删除带有双重http的那些,例如:

$request_url ='YOUR URL';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$result = curl_exec($ch);

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
$validUrls = array();
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $curUrl = $node->getAttribute('href');
    if (substr_count($curUrl,'http')===1) {
        $validUrls[] = $curUrl;
    }
}

var_dump($validUrls); // all urls with only one "http"