我指的是此链接从包含特定字词的网页中提取网址
regex to print url from any webpage with specific word in url
但很少有像pinterest和facebook referal url的网址包含对我感兴趣的字词,但我不想使用facebook,pinterest网址,因为它们不是直接网址,所以我想要排除这些网址,所以我观察到这些网址将包含至少两个http
类似这样的事情
所以我想排除包含至少两个http
的网址答案 0 :(得分:0)
你可以试试这样的东西来避免这些URI:
$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$href = $node->getAttribute('href');
if ( !preg_match('~^http://.+?https?\b~i', $href) )
echo "$href\n";
}
preg_match('~^http://.+?https?\b~i', $href)
应与这些to-be-excluded
URI匹配
答案 1 :(得分:0)
我可能会检查你循环浏览它们并删除带有双重http的那些,例如:
$request_url ='YOUR URL';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';
$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
$validUrls = array();
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$curUrl = $node->getAttribute('href');
if (substr_count($curUrl,'http')===1) {
$validUrls[] = $curUrl;
}
}
var_dump($validUrls); // all urls with only one "http"