如何使用php从html中提取数据

时间:2015-02-25 10:34:38

标签: php html

我已获取此HTML代码。

<td valign="top" style="padding:3px">

    <p>

    <b>Release Year: </b>2005

    <br />

    <b>
    Genre: 
    <a href=/genres/Animation>Animation</a>, 
    <a href=/genres/Comedy>Comedy</a>
    </b>

    <br />

    <b>External Links: </b> 
    <a href="http://www.imdb.com/title/tt0397306/" target="_blank">IMDB</a> 
    <br />

    <b>No. of episodes: </b> 23 episodes 

    <br />

    <b>Latest Episode With Links: </b> 

    <a title="Watch American Dad! Latest Episode (American Dad! Season 1 Episode 23)" href="/episode/american_dad_s1_e23.html">
    Season 1 Episode 23 Tears of a Clooney (14/05/2006)
    </a>

    <br />

    <div style="float: left; height: 30px; overflow: hidden; width: 100px;">

    <div class="fb-like" data-href="http://watchseries.ag/season-1/american_dad" data-send="false" data-layout="button_count" data-show-faces="false"></div>

    </div>

    <a href="https://twitter.com/share" class="twitter-share-button" data-url="http://watchseries.ag/season-1/american_dad">Tweet</a>

    <script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

    <br clear="all" />

    <b>Description :</b> The random escapades of Stan Smith, an extreme right wing CIA agent dealing with family life and keeping America safe, all in the most absurd way possible.<br>

    </p>
    </td>

我只想使用php上面的html代码中的这些信息。

只想要3件事 1.Release Year 2. Imdb Link 3.类型

array(
     'release_year'=>2005, 
     'imbd_link'=>'http://www.imdb.com/title/tt0397306/',
     'genre'=> array(
                     'Animation',
                     'Comedy',
                    )
     );

我还创建了php函数,它过滤html代码并返回我的数组但是它没有给出上面的结果我展示了array()它给我这个结果

输出

Array
(
    [release year] => 2005Genre
)

功能

function do_html_array($td,$dlm='<br>'){
    if(!empty($td)){
        $td = html_entity_decode($td);
        $td = preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $td);
        $html_array = explode($dlm,$td);
        $html_key_array = array();
        foreach($html_array as $key=>$html){
                $html = explode(':',trim(strip_tags($html)));
                if(trim($html[0])!=''){
                    if(count($html)<1) $html[1] = '';                   
                    if(strtolower(trim($html[0]))=='description') $html[1] = str_ireplace('[+]more','',$html[1]);
                    $html_key_array[strtolower(trim($html[0]))] = trim($html[1]);
                    switch(trim(strtolower($html[0]))){
                        case'external links':
                             preg_match_all('~<a\s+.*?</a>~is',$html_array[$key],$html_key_array['imdb_link']);                          
                        break;
                        case'genre':
                             preg_match_all('~<a\s+.*?</a>~is',$html_array[$key],$html_key_array['genre_link']);                             
                        break;
                        // further define here...
                    }
                }
        }
        return $html_key_array;
    }
    return false; 
}

0 个答案:

没有答案