从网页抓取内容

时间:2014-12-05 18:06:04

标签: php html scrape

我正在尝试从此网页中删除内容:

www.motorcyclemonster.com/motorcycle-events.html,我正在使用的代码是:

<?php  

    $content = file_get_contents('http://www.motorcyclemonster.com/motorcycle-events.html');

    $pattern = '#<tr.">\r\n<td>(.*)</td>\r\n<td>(.*)</td>#';

    preg_match_all ($pattern, $content, $data);

    var_dump($data);

    for ($i = 0; $i < 11; $i++)
    {
        echo "<br /><br />". $data[2][$i].' '.$data[3][$i];
    }

?> 

我希望能够从以下

中提取信息
<tr>
    <td width="23%" bgcolor="#76C2FA">Nov 15 - Jan 4</td>
    <td width="52%" bgcolor="#76C2FA"><b> <a href="/events/cars-and-christmas-2014-11-15-Hershey-PA.html" title="Cars and Christmas - Hershey, Pennsylvania">Cars and Christmas</a></b></td>
    <td width="20%" bgcolor="#76C2FA">Hershey</td>
    <td width="5%" bgcolor="#76C2FA">PA</td>
</tr>

并且能够为

设置一些变量
<tr>
    <td width="23%" bgcolor="#76C2FA">**$date**</td>
    <td width="52%" bgcolor="#76C2FA"><b><a href=**$page_url**" title="**$title**">$title</a></b></td>
    <td width="20%" bgcolor="#76C2FA">**$city**</td>
    <td width="5%" bgcolor="#76C2FA">**$state**</td>
</tr>

任何帮助都会感激不尽

2 个答案:

答案 0 :(得分:0)

使用正则表达式解析HTML标记并不是一个好主意(出于各种原因)。我宁愿建议检查PHP DOM,因为它是专门为此目的而做的。

答案 1 :(得分:0)

最简单的方法是使用PHP Simple HTML Dom Parser

<?php
/**
 * Created by PhpStorm.
 * User: Adrian
 * Date: 05/12/2014
 * Time: 19:28
 */
//Load website

include('simple_html_dom.php');
$html = file_get_html('http://www.motorcyclemonster.com/motorcycle-events.html');

//For each table row
$events = array();
foreach($html->find('table',2)->find('tr') as $h){
    $temp = array();
    //get date
    $temp['date'] = $h->find('td', 0)->innertext; //Inner contents of first cell

    if($url = $h->find('td', 1)) {
        if($url = $url->find('a', 0)) {//First link of second cell
            $temp['url'] = $url->href; //href attribute
            $temp['url_title'] = $url->title; //title attribute
            $temp['title'] = $url->innertext; // Inner content of link

            $temp['town'] = $h->find('td', 2)->innertext;
            $temp['state'] = $h->find('td', 3)->innertext;

            $events[] = $temp;
        }
    }
}

print_r($events);

注意:您需要先包含Simple HTML Dom Parser

然后,您可以遍历此事件数组,以便随时显示

作为常见的礼貌,我只提供第一页或第2页,并提供链接回源网站以获取完整列表