我制定了各种正则表达式来抓取数据。
在这里,我可以从页面来源抓取图像:
我从表格td
<?php
$s = file_get_contents('http://www.altassets.net/altassets-events');
$matches = array();
preg_match_all("/<tr>(.*)<\/tr>/sU", $s, $matches);
$trs = $matches[1]; $td_matches = array();
foreach ($trs as $tr) { $tdmatch = array();
preg_match_all("/<td>(.*)<\/td>/sU", $tr, $tdmatch);
$td_matches[] = $tdmatch[1]; } var_dump($td_matches);
//print_r($td_matches);
?>
同样的图像和标题。
但是如何从<p>
标签中抓取具有特定类名的数据?
<p class="review_comment ieSucks" itemprop="description" lang="en"> Some text </p>
考虑这个页面,
http://www.yelp.com/biz/fontanas-italian-restaurant-cupertino
这只是一个例子,只是想知道程序。类名和标签名称可以更改
我想从页面中删除评论及其评分值
答案 0 :(得分:0)
您可以使用Simple HTML Dom解析器。
用法非常简单:
// Create a DOM object from a string
$html = str_get_html('<html><body>Hello!</body></html>');
然后你可以这样做:
// Find all element which id=foo
$ret = $html->find('#foo');
// Find all element which class=foo
$ret = $html->find('.foo');
答案 1 :(得分:0)
不要使用正则表达式。实现PHP原生DOMXPath
或DOMDocument
类..
foreach($dom->getElementsByTagName('p') as $ptag)
{
if($ptag->getAttribute('class')=="review_comment ieSucks")
{
echo $ptag->nodeValue; //"prints" Some text
}
}
循环遍历所有段落标记并查看是否在属性上找到匹配项,如果找到,则只需打印节点的值。
<?php
libxml_use_internal_errors(true);
$html=file_get_contents('http://www.yelp.com/biz/fontanas-italian-restaurant-cupertino');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('p') as $ptag)
{
if($ptag->getAttribute('class')=="review_comment ieSucks")
{
echo "<h6>".$ptag->nodeValue."</h6>";
}
}
答案 2 :(得分:0)
以下是数据报废的完整示例+按类名称获取元素
function get_web_page( $url )
{
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';
$options = array(
CURLOPT_CUSTOMREQUEST =>"GET", //set request type post or get
CURLOPT_POST =>false, //set to GET
CURLOPT_USERAGENT => $user_agent, //set user agent
CURLOPT_COOKIEFILE =>"cookie.txt", //set cookie file
CURLOPT_COOKIEJAR =>"cookie.txt", //set cookie jar
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$dom = new DOMDocument();
$dom->loadHTML($content);
$finder = new DomXPath($dom);
$classname="CLASS_NAME";
$nodes = $finder->query("//*[contains(@class, '$classname')]");
foreach ($nodes as $key => $ele) {
print_r($ele->nodeValue);
}
}
get_web_page('DATA_SCRAP_URL_GOES_HERE');