如何使用php中的strip_tags函数获取所需内容

时间:2014-11-06 05:37:32

标签: php strip-tags

我正在使用strip_tags函数来获取所需内容,但它从链接中获取整个数据 请参阅下面的示例代码,用于从链接中获取内容:

<?php

$a=fopen("http://example.com/","r");
$contents=stream_get_contents($a);
fclose($a);
$contents1=strtolower($contents);

$start='<div id="content">';

$start_pos=strpos($contents1,$start);
$first_trim=substr($contents1,$start_pos);

$stop='</div><!-- content -->';
$stop_pos=strpos($first_trim,$stop);

$second_trim=substr($first_trim,0,$stop_pos+6);
$second_trim = strip_tags($second_trim, '<div><table><tbody><tr><td><a><h2><h4>');
echo "<div>$second_trim</div>";
?> 

这是在$ second_trim中获取的html代码:

<div><div id="content">
<div id="issuedescription"></div>
    <h2 class="wsite-content-title" style="text-align:center;">download content<br /><font     color="#f30519">table of content</font><br />&nbsp;<font color="#f80117"> content&nbsp;</font></h2>

    <h2>table of contents</h2>   
<h4 class="tocsectiontitle">editorial</h4>
<h2 class="wsite-content-title" style="text-align:left;">technical note</h2>        
<table class="tocarticle" width="100%">
<tr valign="top">           
<td class="toctitle" width="95%" align="left"><a     href="http://example.com/">where are we at and where are we heading to?</a>            </td>
    <td class="tocgalleys" width="5%" align="left">
                                <a href="http://example.com/"     class="file">pdf</a>                                          
</td>
</tr>
<tr>
<td class="tocauthors" width="95%" align="left">
                                sergio eduardo de paiva gonã§alves                      </td>
    <td class="tocpages" width="5%" align="left">1-2</td>
</tr>
</table>
<div class="separator"></div>
h4 class="tocsectiontitle">some text here</h4>

<table class="tocarticle" width="100%">
<tr valign="top">

    <td class="toctitle" width="95%" align="left"><a     href="http://example.com/">some text here</a></td>
    <td class="tocgalleys" width="5%" align="left">
                                <a href="http://example.com/"     class="file">pdf</a>

            </td>
</tr>
    <tr>
<td class="tocauthors" width="95%" align="left">
                                some text here,                         some text here,                         some text here,                         some text here,                         some text here,                         some text here                      </td>
    <td class="tocpages" width="5%" align="left">3-10</td>
</tr>
</table>
    <a target="_blank" rel="license" href="http://example.com/">    
    </a>
    some text here<a rel="license" target="_blank" href="http://example.com/">example</a>.
    </div></div> 

现在我的问题是我想仅使用strip_tag函数从下面给出的两个第二个锚中获取一个特定的标记

<a href="http://example.com/" class="file">pdf</a>
<a href="http://example.com/">some text here</a>

和第二个标题来自以下两个:

<h2 class="wsite-content-title" style="text-align:center;">download content<br /><font color="#f30519">table of content</font><br />&nbsp;<font color="#f80117"> content&nbsp;</font></h2>

<h2>table of contents</h2>

但是条带标记功能要么取出所有这些,要么都不取出它们,那么如何让它们识别以获取我想要的标记而不是取出所有类似的标签。如果它们是更好的方法,请分享你的想法在这里!!

1 个答案:

答案 0 :(得分:0)

正则表达式可以做这样的事情:

function handle_link($data) {
    list($link, $attributes, $content) = $data;
    $classes = preg_match('#class=[\'"]([^\'"]+)[\'"]#', $attributes, $match) ? preg_split('#\s+#', $match[1]) : array();
    // If the link has the "file" class
    if(in_array('file', $classes)) {
        return $content; // only the internal content (like strip_tags would do)
        // or you can return a new link:
        // return '<a href="myfile" class="myclass">' . $content . '</a>';
    } else {
        return $link; // all the link not filtered
    }
}

$second_trim = strip_tags($second_trim, '<div><table><tbody><tr><td><h2><h4>');
$second_trim = preg_replace_callback('#<a([^>]*)>(.+)</a>#U', 'handle_link', $second_trim);