web scrape使用preg_match_all

时间:2013-01-15 01:41:16

标签: php html-parsing web-scraping

我正在尝试使用PHP从此网站http://www.internic.net/registrars/registrar-967.html获取联系信息。我可以通过使用href链接获取电子邮件广告:

$contactStr = "http://www.internic.net/registrars/registrar-967.html";
                $contact_string = file_get_contents("$contactStr");
                preg_match_all('/<a href="(.*)">(.*)<\/a>/i', $contact_string, $contactInfo);
                $email = str_replace("mailto:", "", $contactInfo[1][6]); 

但是,我很难获得地址和手机#,因为没有我可以使用的html元素&lt; p>也许..我只需要1800 SW First Ave.,Suite 440 Portland OR 97201 United States和310-467-2549 from this site ..请赐教我如何做到这一点  使用preg_match_all或其他一些方法..谢谢!

1 个答案:

答案 0 :(得分:0)

正如其他人在评论中所说的那样,而不是使用正则表达式尝试DOMDocument

这是一个例子(有点hacky tho)希望它有所帮助:

function get_register_by_id($id){
    $site = file_get_contents('http://www.internic.net/registrars/registrar-'.$id.'.html');
    $dom = new DOMDocument();
    @$dom->loadHTML($site);
    $result = array();
    foreach($dom->getElementsByTagName('td') as $td) {
        if($td->getAttribute('width')=='420'){
            $innerHTML= '';
            $children = $td->childNodes;
            foreach ($children as $child) {
                $innerHTML .= trim($child->ownerDocument->saveXML($child));
            }
            $fixed = array_map('strip_tags', array_map('trim', explode("<br/>",trim($innerHTML))));
            foreach($fixed as $val){
                if(empty($val)){continue;}

                $result[] = str_replace(array('! '),'',$val);
            }
        }
    }
    return $result;
}


print_r(get_register_by_id(965));
/*Array
(
    [0] => Domain Central Australia Pty Ltd.
    [1] => Level 27
    [2] => 101 Collins Street
    [3] => Melbourne Victoria 3000
    [4] => Australia
    [5] => +64 300 4192
    [6] => robert.rolls@domaincentral.com.au
)*/
print_r(get_register_by_id(966));
/*
Array
(
    [0] => Web Business, LLC
    [1] => PO Box 1417
    [2] => Golden CO 80402
    [3] => United States
    [4] => +1.303.524.3469
    [5] => support@webbusiness.biz
)*/

print_r(get_register_by_id(967));
/*
Array
(
    [0] => #1 Host Australia, Inc.
    [1] => 1800 SW First Ave., Suite 440
    [2] => Portland OR 97201
    [3] => United States
    [4] => 310-467-2549
    [5] => registry-operations@moniker.com
)*/