如何构建正则表达式以提取<table>
的内容?我想刮一个网站,但不是第一个表,只是页面中的第二个。我这样做:
preg_match('/<table[^>]+cellspacing="0"[^>]*>(.*?)<\/table>', $returnCurl, $features);
,HTML为here
我想要&#34;功能&#34;仅限表格。
答案 0 :(得分:2)
过早接受我认为,如果你想使用DOMDocument来做,那么这里是我之前构建的通用DOM报废类,非常基础..如果你想要更多功能,还有Simple HTML DOM但是底线是不要使用正则表达式解析HTML!
<?php
$site = 'http://www.grossiste-informatique.com/grossiste/detail_article_popup.php?code_article=POA/F200CA-KX019H';
$scraper = new DOMScraper();
//Set site and get source
$scraper->setSite($site)
->setSource();
echo '<table cellspacing="0" cellpadding="3" border="0" width="100%">',
//match and return only tables inner content with cellpadding="3"
$scraper->getInnerHTML('table', 'cellpadding=3'),
'</table>';
/**
* Generic DOM scapper using DOMDocument and cURL
*/
Class DOMScraper extends DOMDocument{
public $site;
private $source;
private $dom;
function __construct(){
libxml_use_internal_errors(true);
$this->preserveWhiteSpace = false;
$this->strictErrorChecking = false;
$this->formatOutput = true;
}
function setSite($site){
$this->site = $site;
return $this;
}
function setSource(){
if(empty($this->site))return 'Error: Missing $this->site, use setSite() first';
$this->source = $this->get_data($this->site);
return $this;
}
function getInnerHTML($tag, $id=null, $nodeValue = false){
if(empty($this->site))return 'Error: Missing $this->source, use setSource() first';
$this->loadHTML($this->source);
$tmp = $this->getElementsByTagName($tag);
$ret = null;
foreach ($tmp as $v){
if($id !== null){
$attr = explode('=',$id);
if($v->getAttribute($attr[0])==$attr[1]){
if($nodeValue == true){
$ret .= trim($v->nodeValue);
}else{
$ret .= $this->innerHTML($v);
}
}
}else{
if($nodeValue == true){
$ret .= trim($v->nodeValue);
}else{
$ret .= $this->innerHTML($v);
}
}
}
return $ret;
}
function innerHTML($dom){
$ret = "";
$nodes = $dom->childNodes;
foreach($nodes as $v){
$tmp = new DOMDocument();
$tmp->appendChild($tmp->importNode($v, true));
$ret .= trim($tmp->saveHTML());
}
return $ret;
}
function get_data($url){
if(function_exists('curl_init')){
$ch = curl_init();
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}else{
return file_get_contents($url);
}
}
}
?>
答案 1 :(得分:-3)
I'll be the first to link you to the relevant post
改为使用DOMDocument。
另外,如果你真的想要(而且你真的不想要这个),你可以试试这个正则表达式(未经测试):
preg_match('/<table[^>]+>.*?<table[^>]+>(.*?)<\/table>/is', $returnCurl, $features);