使用div的PHP web抓取

时间:2014-09-06 11:05:00

标签: php html web-scraping

我已尝试过所有内容,但我已阅读其他问题,但它无法正常工作。

我想从这个网站:

http://www.interparcel.com/tracking.php?action=dotrack&trackno=RE367831140GR

提取这个:

  

抱歉,没有找到这些详细信息的寄售。错误 - 未收到xml数据

我也尝试过网站parcelforce.comdhl.com:相同的程序,结果是零匹配。

我尝试过的事情(很多):

$curl = curl_init('http://www.interparcel.com/tracking.php?action=dotrack&trackno=$nummm');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

$page = curl_exec($curl);

if(curl_errno($curl)) // check for execution errors
{
    echo 'Scraper error: ' . curl_error($curl);
    exit;
}

curl_close($curl);

$regex = '/<div class="header-description">(.*?)</div>/s';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else 
    print "Not found"; 

<?php // File: MatchAllDivMain.php

// Read html file to be processed into $data variable
$data = file_get_contents('test.html');

// Commented regex to extract contents from <div class="main">contents</div>
//  where "contents" may contain nested <div>s.
//  Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{           # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*>              # match the "main" class DIV opening tag
  (                                   # capture "main" DIV contents into $1
    (?:                               # non-cap group for nesting * quantifier
      (?: (?!<div[^>]*>|</div>). )++  # possessively match all non-DIV tag chars
    |                                 # or 
      <div[^>]*>(?1)</div>            # recursively match nested <div>xyz</div>
    )*                                # loop however deep as necessary
  )                                   # end group 1 capture
</div>                               # match the "main" class DIV closing tag
}six';  // single-line (dot matches all), ignore case and free spacing modes ON

// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>}si';

$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
    echo("$matchcount matches found.\n");
    //  print_r($matches);
    for($i = 0; $i < $matchcount; $i++) {
        echo("\nMatch #" . ($i + 1) . ":\n");
        echo($matches[1][$i]); // print 1st capture group for match number i
    }
} else {
    echo('No matches');
}
echo("\n</pre>");
?>

以下描述的方法:

一切都没有成功,对我做错了什么有帮助?

0 个答案:

没有答案