Regex PHP - 在<div> </div>标记内查找子字符串

时间:2011-10-18 23:43:10

标签: regex wordpress html substring

首先:我知道我不应该使用正则表达式来解析HTML。我已经读了很多次了。但是我必须使用的工具是一个基于正则表达式的工具,所以我不能使用HTML解析器或任何其他东西。无论如何,我感谢所有的关注,但如果我需要做的事情与正则表达式一起工作,那很好。如果没有,抱歉,我们将不得不放弃此功能。

问题是:

简短说明:我需要一个正则表达式来返回PHP生成的网页中的标记内包含的子字符串(Wordpress,重要的是)。

冗长的解释:我需要查找游戏名称的每个实例(在此示例中,游戏的名称Batman: Arkham City)位于我页面中存在的各种<div class="post-bodycopy clearfix">内。这意味着我只想要帖子正文中的游戏名称,而不是帖子标题或侧边栏或任何地方。然后,我将使用preg replace或类似的东西替换此名称。

我在网上搜索了一个类似的问题,但我只能找到这样的“gimme all that that inside tags”问题。

以下是我生成的代码中的典型帖子:

<div class="post-268445 post hentry category-world-community-gamer tag-games tag-geral tag-lancamentos tag-noticias tag-pc tag-ps3 tag-xb360" id="post-268445">
<div class="post-kicker"><?php get_cat_icon(); ?><a href="http://www.gameblogs.com.br/category/world-community-gamer/" title="World Community Gamer" onclick="return TrackClick('http://www.gameblogs.com.br/category/world-community-gamer/','')"><img src="http://www.gameblogs.com.br/wp-content/uploads/world-community-gamer.png" width="48" height="48" alt="" title="World Community Gamer" /></a></div>
<div class="post-headline">     <h2>    <a href="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc" rel="bookmark" title="Permanent Link to Data para Batman: Arkham City no PC" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc','')">Data para Batman: Arkham City no PC</a></h2>   </div>
<div class="post-byline"><img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/user.gif" alt="" /> <a href="http://www.gameblogs.com.br/author/_otaviofqueiroz/" title="Posts de @_otaviofqueiroz" onclick="return TrackClick('http://www.gameblogs.com.br/author/_otaviofqueiroz/','')">@_otaviofqueiroz</a>, do <img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/home.gif" alt="" /> <a href="http://www.worldcommunitygamer.com/" target="_blank" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/','')">WCG | World Community Gamer: Jogos, Análises e Tecnologia</a>, <img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/calendar_month.png" alt="" /> 18/10/11 | Compartilhe: <a href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc" data-text="WCG | World Community Gamer: Jogos, Análises e Tecnologia: Data para Batman: Arkham City no PC" data-count="horizontal" data-via="GameBlogsBR" data-lang="fr" target="_blank" onclick="return TrackClick('http://twitter.com/share','')">Tweet</a><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script></div><div class="post-bodycopy clearfix"><p> <a href="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html" imageanchor="1" style="margin-left: 1em; margin-right: 1em;" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html','')"><img src="/wp-content/plugins/wordpress-image-resizer/thumb/phpThumb.php?fltr=usm&#038;src=http://2.bp.blogspot.com/-9oKlgIND3qY/Tp3Aimju2nI/AAAAAAAABxA/Q585nqpdsRI/s1600/batman_arkham_city_screens16-620x348.jpg&#038;w=200" align='left'></a>
<p>A Warner divulgou a data de lançamento para Batman: Arkham City no PC. O jogo que terá a sua versão para os consoles (PS3 e Xbox 360) lançada nessa sexta-feira, chegará as lojas na versão PC no dia 18 de Novembro. Apesar da demora [...]<br /><a href=http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&#038;utm_campaign=data-para-batman-arkham-city-no-pc>[continua no site original...]</a></p></div>
<div class="post-footer"><img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/tag.gif" alt="" /> <a href="http://www.gameblogs.com.br/tag/games/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/games/','')">Games</a>, <a href="http://www.gameblogs.com.br/tag/geral/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/geral/','')">Geral</a>, <a href="http://www.gameblogs.com.br/tag/lancamentos/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/lancamentos/','')">lançamentos</a>, <a href="http://www.gameblogs.com.br/tag/noticias/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/noticias/','')">Notícias</a>, <a href="http://www.gameblogs.com.br/tag/pc/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/pc/','')">PC</a>, <a href="http://www.gameblogs.com.br/tag/ps3/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/ps3/','')">PS3</a>, <a href="http://www.gameblogs.com.br/tag/xb360/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/xb360/','')">XB360</a><br>Todos os posts do <a href="http://www.gameblogs.com.br/category/world-community-gamer/" onclick="return TrackClick('http://www.gameblogs.com.br/category/world-community-gamer/','')">World Community Gamer</a></div></div><!-- / Post -->

我已经为查找尝试了以下内容:

$<div class\=\"post-bodycopy clearfix\">(.+?)(Batman: Arkham City)(.+?)(?=<div class\=\"post-footer\">)$s

含义:找到div开始标记,然后是任何内容,然后是Batman:Arkham City,后跟任何内容,直到post footer的开始div标记,考虑多行。

以下代替:

<div class="post-bodycopy clearfix">/1<a href="http://www.mylink">Batman: Arkham City</a>/3

出于某种原因,正则表达式在http://regexlib.com中工作,返回所有预期的部分,但不在我的实时网站中。这一定是个小问题。

但是,我确信我的解决方案不是最优雅(和服务器消耗/更便宜)的方式来找到这样的子串,因为我保存了各种部分只是改变其中一个。

有更聪明的方法来实现这一目标吗?请?

非常感谢!

3 个答案:

答案 0 :(得分:0)

$ title ='蝙蝠侠:阿卡姆城';

搜索:{(?<=<div class="post-bodycopy clearfix">)(.+?)($title)(.+?)(?=<div class="post-footer">)}s

取代:
\1<a href="http://www.mylink">\2</a>\3

$1<a href="http://www.mylink">$2</a>$3

修改
你可以尝试以下。示例php在这里http://ideone.com/JtH4s

$title = 'Batman: Arkham City';
$divclass = 'post-bodycopy clearfix';

$rxtag =
'<
 (?:
     \?php\s+.*?\?
  |  (?:
       (?:
           (?:script|style)\s*
         | (?:script|style)\s+(?:".*?"|\'.*?\'|[^>]*?)+\s*
       )> .*? </(?:script|style)\s*
     )
  |  (?:
         /?[A-Za-z_:][\w:.-]*\s*/?
       |  [A-Za-z_:][\w:.-]*\s+(?:".*?"|\'.*?\'|[^>]*?)+\s*/?
       | !(?:DOCTYPE.*?|--.*?--)
     )
 )
 >
';

// Or,
// $rxtag_optional = '<[^<>]+?>';
// $rxtag = $rxtag_optional;



$rxmain =
"~(?xs:
   ( <div (?=\\s)[^>]*
          (?<=\\s) class \\s* = \\s* \" \\s* (?i-x:$divclass) \\s* \"
          [^>]* (?<!/)
     >
     (?:
         (?! </?div | (?-x:$title))
         (?> $rxtag  | [^<] | <)
     )*?
   )
   ( (?-x:$title) )
   (
      (?: (?!</?div) (?> $rxtag  | [^<] | <) )*?
      </div \\s*>
   )
 )
~";

//print "$rxmain\n\n";

$count = 0;

$newhtml = preg_replace( $rxmain,
                         "$1<a href=\"http://www.mylink\">$2</a>$3",
                         $html,
                         1,
                         $count );

答案 1 :(得分:0)

我在PHP中使用以下正则表达式组合了一个示例here

'|(<div class="post-bodycopy clearfix">)(.*?)(Batman: Arkham City)(.*?)(</div>)|e'

我在html字符串的底部添加了一个蝙蝠侠:Arkham City,只是为了测试。它似乎工作。让我知道。

答案 2 :(得分:0)

如果你坚持使用正则表达式,并且你的<div class="post-bodycopy clearfix">...</div>元素永远不会包含任何嵌套的DIV,那么这是一个应该做得不错的双回调解决方案:

// Linkify title inside post-bodycopy DIV text.
function p($text) {
    global $title, $link;
    // Set title to be found and linkify URL address.
    $title = 'Batman: Arkham City';
    $link = 'http://www.mylink';
    // Match non-nested "post-bodycopy" class DIV element.
    $re = '%<div class="post-bodycopy clearfix">(.+?)</div>%si';
    return preg_replace_callback($re, 'p_cb', $text);
}
function p_cb($matches) {
    // Match tag (in $1) and non-tag stuff (in $2).
    $re = '%
          ( </?\w+   # Either $1: An open or close tag.
            (?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|\'[^\']*\'|[^\s<>]+))?)*
            \s*/?>
          )
        | ( [^<]+ )  # Or $2: Non-tag stuff.
        %x';
    $matches[1] = preg_replace_callback($re, 'p_cb_cb', $matches[1]);
    return '<div class="post-bodycopy clearfix">'. $matches[1] .'</div>';
}
function p_cb_cb($matches) {
    global $title, $link;
    # Return open and close tags unchanged.
    if (isset($matches[1]) && $matches[1]) return $matches[1];
    # Process non-tag text, converting text to link.
    $matches[2] = str_replace(
        $title,
        '<a href="'. $link .'">'. $title .'</a>',
        $matches[2]);
    return $matches[2];
}

p()函数处理HTML文件内容。它的正则表达式匹配<div class="post-bodycopy clearfix">...</div>元素,并将DIV内容传递给p_cb()回调函数。然后,第一个回调函数使用正则表达式来处理/处理DIV的内容,该正则表达式匹配开放或关闭标记(进入捕获组$1)或非标记内容(进入捕获组$2)。这反过来调用第二个回调函数p_cb_cb(),它只是返回打开和关闭标记(在$1中),然后使用str_replace()$title文本的所有实例转换为想要的链接。

请注意,您的HTML标记无效。它有许多不带引号的标签属性值(应该引用它)。