从字符串php regex中删除所有div标签?

时间:2012-03-07 18:31:12

标签: php regex replace str-replace strip-tags

我在网站上有一个所见即所得。问题是用户正在将大量数据复制到其中,从而留下许多未打开且格式不正确的div标签,这些标签打破了网站布局。

是否可以轻松轻松地删除所有<div></div>

str_replace无法正常工作,因为某些div中包含样式和其他内容,因此需要考虑<div style="some styling"> <div align="center">等。

我猜这可以用正则表达式来完成,但是当涉及到这些时,我总是一个初学者。

4 个答案:

答案 0 :(得分:6)

最好将DOM用于HTML解析器,但如果您别无选择,只能使用RegEx,那么您可以像这样使用它:

$patterns = array();
$patterns[0] = '/<div[^>]*>/';
$patterns[1] = '/<\/div>/';
$replacements = array();
$replacements[2] = '';
$replacements[1] = '';
echo preg_replace($patterns, $replacements, $html);

答案 1 :(得分:0)

这是一个简单的示例,说明如何使用PHP

    <?php
    /**
     * Removes the divs because why not
     */
    function strip_divs(&$text, $id = 'html') {
      $replacements = array();
      worker($text, $replacements, $id);

      foreach ($replacements as $key => $val) {
        $text = mb_str_replace($key, $val, $text);
      }

      return $text;
    }

    function worker(&$body, &$replacements, $id) {
      static $call_count;
      if (empty($call_count)) {
        $call_count = array();
      }
      if (empty($call_count[$id])) {
        $call_count[$id] = 0;
      }

      if (mb_strpos($body, '</div>')) {
        $body = mb_str_replace('</div>', '', $body);
      }

      if (mb_strpos($body, '<di') !== FALSE) {
        $call_count[$id] ++;
        // Gets the important junk
        $rm               = '<di' . xml_get($body, '<di', '>') . '>';
        // Builds the replacements HTML
        $replacement_html = '';

        $next_id                       = count($replacements);
        $replacement_id                = "[[div-$next_id]]";
        $replacements[$replacement_id] = $replacement_html;

        $body = mb_str_replace($rm, $replacement_id, $body);

        if (mb_strpos($body, '<di') !== FALSE && $call_count[$id] < 200) {
          worker($body, $replacements, $id);
        }
      }
    }


    /**
     * Returns text by specifying a start and end point
     *
     * @param str $str
     *   The text to search
     * @param str $start
     *   The beginning identifier
     * @param str $end
     *   The ending identifier
     */
    function xml_get($str, $start, $end) {
      $str = "|" . $str . "|";
      $len = mb_strlen($start);
      if (mb_strpos($str, $start) > 0) {
        $int_start = mb_strpos($str, $start) + $len;
        $temp      = right($str, (mb_strlen($str) - $int_start));
        $int_end   = mb_strpos($temp, $end);
        $return    = trim(left($temp, $int_end));
        return $return;
      }
      else {
        return FALSE;
      }
    }

    function right($str, $count) {
      return mb_substr($str, ($count * -1));
    }

    function left($str, $count) {
      return mb_substr($str, 0, $count);
    }

    /**
     * Multibyte str replace
     */
    if (!function_exists('mb_str_replace')) {

      function mb_str_replace($search, $replace, $subject, &$count = 0) {
        if (!is_array($subject)) {
          $searches     = is_array($search) ? array_values($search) : array($search);
          $replacements = is_array($replace) ? array_values($replace) : array($replace);
          $replacements = array_pad($replacements, count($searches), '');
          foreach ($searches as $key => $search) {
            $parts   = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
          }
        }
        else {
          foreach ($subject as $key => $value) {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
          }
        }
        return $subject;
      }

    }

    $html = <<<HTML
    <table>
        <tbody>
            <tr>
                <td class="votecell">
                    <div class="vote">
                        <input type="hidden" name="_id_" value="9607101">
                        <a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>
                        <span itemprop="upvoteCount" class="vote-count-post ">0</span>
                        <a class="vote-down-off" title="This question does not show any research effort; it is unclear or not useful">down vote</a>
                        <a class="star-off" href="#">favorite</a>
                        <div class="favoritecount"><b></b></div>
                    </div>
                </td>
                <td class="postcell">
                    <div>
                        <div class="post-text" itemprop="text">
                            <p>I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout. </p>
                            <p>Is there an easy an easy way to strip all occurrences of <code>&lt;div&gt;</code> and <code>&lt;/div&gt;</code>?</p>
                            <p>str_replace won't work because some of the divs have styling and other things in them so it would need to account for <code>&lt;div style="some styling"&gt; &lt;div align="center"&gt;</code> etc</p>
                            <p>I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those. </p>
                            <p>Thanks a lot,
                                Martin
                            </p>
                        </div>
                        <div class="post-taglist">
                            <a href="/questions/tagged/php" class="post-tag js-gps-track" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/regex" class="post-tag js-gps-track" title="show questions tagged 'regex'" rel="tag">regex</a> <a href="/questions/tagged/replace" class="post-tag js-gps-track" title="show questions tagged 'replace'" rel="tag">replace</a> <a href="/questions/tagged/str-replace" class="post-tag js-gps-track" title="" rel="tag">str-replace</a> <a href="/questions/tagged/strip-tags" class="post-tag js-gps-track" title="show questions tagged 'strip-tags'" rel="tag">strip-tags</a>
                        </div>
                        <table class="fw">
                            <tbody>
                                <tr>
                                    <td class="vt">
                                        <div class="post-menu"><a href="/q/9607101" title="short permalink to this question" class="short-link" id="link-post-9607101">share</a><span class="lsep">|</span><a href="/posts/9607101/edit" class="suggest-edit-post" title="">improve this question</a></div>
                                    </td>
                                    <td align="right" class="post-signature">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                <a href="/posts/9607101/revisions" title="show all edits to this post">edited <span title="2012-03-07 18:32:29Z" class="relativetime">Mar 7 '12 at 18:32</span></a>
                                            </div>
                                            <div class="user-gravatar32">
                                            </div>
                                            <div class="user-details">
                                                <div class="-flair">
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                    <td class="post-signature owner">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                asked <span title="2012-03-07 18:31:11Z" class="relativetime">Mar 7 '12 at 18:31</span>
                                            </div>
                                            <div class="user-gravatar32">
                                                <a href="/users/702826/martin-hunt">
                                                    <div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a578c3eae229c86dbe46d4b1603e071b?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32"></div>
                                                </a>
                                            </div>
                                            <div class="user-details">
                                                <a href="/users/702826/martin-hunt">Martin Hunt</a>
                                                <div class="-flair">
                                                    <span class="reputation-score" title="reputation score " dir="ltr">313</span><span title="7 silver badges"><span class="badge2"></span><span class="badgecount">7</span></span><span title="20 bronze badges"><span class="badge3"></span><span class="badgecount">20</span></span>
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
            <tr>
                <td class="votecell"></td>
                <td>
                    <div id="comments-9607101" class="comments ">
                        <table>
                            <tbody data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
                                <tr id="comment-12187969" class="comment ">
                                    <td class="comment-actions">
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        <span title="number of 'useful comment' votes received" class="cool">1</span>
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy">So you need to remove all the div tags but not the content between the div. Am I right?</span>
                                            –&nbsp;<a href="/users/500725/siva-charan" title="14,075 reputation" class="comment-user">Siva Charan</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12187969_9607101"><span title="2012-03-07 18:34:11Z" class="relativetime-clean">Mar 7 '12 at 18:34</span></a></span>
                                        </div>
                                    </td>
                                </tr>
                                <tr id="comment-12189778" class="comment ">
                                    <td>
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        &nbsp;&nbsp;
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy"><a href="http://stackoverflow.com/a/4667535/208809">Replace the XPath with <code>//div[not[@*]]</code></a> to remove all div elements (incl. content) without attributes.</span>
                                            –&nbsp;<a href="/users/208809/gordon" title="225,421 reputation" class="comment-user">Gordon</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12189778_9607101"><span title="2012-03-07 19:58:21Z" class="relativetime-clean">Mar 7 '12 at 19:58</span></a></span>
                                            <span class="edited-yes" title="this comment was edited 2 times"></span>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                    <div id="comments-link-9607101" data-rep="50" data-anon="true">
                        <a class="js-add-link comments-link disabled-link " title="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.">add a comment</a><span class="js-link-separator dno">&nbsp;|&nbsp;</span>
                        <a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick=""></a>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
    HTML;

    echo strip_divs($html);

答案 2 :(得分:-1)

没有。您使用正则表达式执行 NOT 解析/操作HTML。

正则表达不能讨价还价。他们不能被推理。他们不懂html,他们没有grok xml。并且他们绝对会停止,直到你的DOM树死了。

您使用htmlpurifier和/或DOM来操纵树。

答案 3 :(得分:-1)