preg_match_all如何获取所有链接?

时间:2015-12-06 13:16:41

标签: php regex web-scraping preg-match-all

我正在尝试将所有图像链接与preg_match_all以http://i.ebayimg.com/开头并以.jpg结尾,从我正在抓取的页面...我无法正确执行... :(我试过这个,但这不是我需要的......:

preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $contentas, $img_link);

同样的问题是普通链接...我不知道怎么写preg_match_all到这个:

<a class="link--muted" href="http://suchen.mobile.de/fahrzeuge/details.html?id=218930381&daysAfterCreation=7&isSearchRequest=true&withImage=true&scopeId=C&categories=Limousine&damageUnrepaired=NO_DAMAGE_UNREPAIRED&zipcode=&fuels=DIESEL&ambitCountry=DE&maxPrice=11000&minFirstRegistrationDate=2006-01-01&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=20&pageNumber=1" data-touch="hover" data-touch-wrapper=".cBox-body--resultitem">

非常感谢!!!

更新 我在这里尝试: http://suchen.mobile.de/fahrzeuge/search.html?isSearchRequest=true&scopeId=C&makeModelVariant1.makeId=1900&makeModelVariant1.modelId=10&makeModelVariant1.modelDescription=&makeModelVariantExclusions%5B0%5D.makeId=&categories=Limousine&minSeats=&maxSeats=&doorCount=&minFirstRegistrationDate=2006-01-01&maxFirstRegistrationDate=&minMileage=&maxMileage=&minPrice=&maxPrice=11000&minPowerAsArray=&maxPowerAsArray=&maxPowerAsArray=PS&minPowerAsArray=PS&fuels=DIESEL&minCubicCapacity=&maxCubicCapacity=&ambitCountry=DE&zipcode=&q=&climatisation=&airbag=&daysAfterCreation=7&withImage=true&adLimitation=&export=&vatable=&maxConsumptionCombined=&emissionClass=&emissionsSticker=&damageUnrepaired=NO_DAMAGE_UNREPAIRED&numberOfPreviousOwners=&minHu=&usedCarSeals=获取汽车链接和图像链接以及所有信息,信息一切正常,我的脚本运行良好,但我有抓图像和链接的问题..这是我的脚本:

<?php

        $id= $_GET['id'];
        $user= $_GET['user'];
        $login=$_COOKIE['login'];

    $query = mysql_query("SELECT pavadinimas,nuoroda,kuras,data,data_new from mobile where vartotojas='$user' and id='$id'");
    $rezultatas=mysql_fetch_row($query);

    $url = "$rezultatas[1]";

    $info = file_get_contents($url); 

function scrape_between($data, $start, $end){
$data = stristr($data, $start); 
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return $data;
  }
     //turinio iskirpimas
    $turinys = scrape_between($info, '<div class="g-col-9">', '<footer class="footer">');
     //filtravimas naikinami mokami top skelbimai
    $contentas = preg_replace('/<div class="cBox-body cBox-body--topResultitem".*?>(.*?)<\/div>/', '' ,$turinys);
    //filtravimas baigtas

      preg_match_all('/<span class="h3".*?>(.*?)<\/span>/',$contentas,$pavadinimas); 

      preg_match_all('/<span class="u-block u-pad-top-9 rbt-onlineSince".*?>(.*?)<\/span>/',$contentas,$data); 

      preg_match_all('/<span class="u-block u-pad-top-9".*?>(.*?)<\/span>/',$contentas,$miestas);

      preg_match_all('/<span class="h3 u-block".*?>(.*?)<\/span>/', $contentas, $kaina);

      preg_match_all('/<a[A-z0-9-_:="\.\/ ]+href="(http:\/\/suchen.mobile.de\/fahrzeuge\/[^"]*)"[A-z0-9-_:="\.\/ ]\s*>\s*<div/s', $contentas, $matches);

   print_r($pavadinimas);
   print_r($data);
   print_r($miestas);
   print_r($kaina);
   print_r($result);
   print_r($matches);

   ?>

2 个答案:

答案 0 :(得分:1)

1。要从所有src标记的http://i.ebayimg.com/开始捕获img属性:

正则表达式:/src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i

以下是一个例子:

$re = "/src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i"; 
$str = "codeOfHTMLPage"; 
preg_match_all($re, $str, $matches);

现场查看:here

如果您想确保在img代码上捕获此网址,请使用此正则表达式(请注意,如果网页很长,性能会降低):

$re = "/<img(?:.*?)src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i";

2。要从所有href标记的http://i.ebayimg.com/开始捕获a属性:

正则表达式:/href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i

以下是一个例子:

$re = "/href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i; 
$str = "codeOfHTMLPage"; 
preg_match_all($re, $str, $matches);

现场查看:here

如果您想确保在a代码上捕获此网址,请使用此正则表达式(请注意,如果网页很长,性能会降低):

$re = "/<a(?:.*?)href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i";

答案 1 :(得分:1)

DOMDocument更方便:

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($yourURL);

$imgNodes = $dom->getElementsByTagName('img');

$result = [];

foreach ($imgNodes as $imgNode) {
    $src = $imgNode->getAttribute('src');
    $urlElts = parse_url($src);
    $ext = strtolower(array_pop(explode('.', $urlElts['path'])));
    if ($ext == 'jpg' && $urlElts['host'] == 'i.ebayimg.com')
        $result[] = $src;
}

print_r($result);

得到正常的&#34;链接,使用相同的方式(DOMDocument + parse_url)。