我在网站上有一个所见即所得。问题是用户正在将大量数据复制到其中,从而留下许多未打开且格式不正确的div标签,这些标签打破了网站布局。
是否可以轻松轻松地删除所有<div>
和</div>
?
str_replace无法正常工作,因为某些div中包含样式和其他内容,因此需要考虑<div style="some styling"> <div align="center">
等。
我猜这可以用正则表达式来完成,但是当涉及到这些时,我总是一个初学者。
答案 0 :(得分:6)
最好将DOM用于HTML解析器,但如果您别无选择,只能使用RegEx,那么您可以像这样使用它:
$patterns = array();
$patterns[0] = '/<div[^>]*>/';
$patterns[1] = '/<\/div>/';
$replacements = array();
$replacements[2] = '';
$replacements[1] = '';
echo preg_replace($patterns, $replacements, $html);
答案 1 :(得分:0)
这是一个简单的示例,说明如何使用PHP
<?php
/**
* Removes the divs because why not
*/
function strip_divs(&$text, $id = 'html') {
$replacements = array();
worker($text, $replacements, $id);
foreach ($replacements as $key => $val) {
$text = mb_str_replace($key, $val, $text);
}
return $text;
}
function worker(&$body, &$replacements, $id) {
static $call_count;
if (empty($call_count)) {
$call_count = array();
}
if (empty($call_count[$id])) {
$call_count[$id] = 0;
}
if (mb_strpos($body, '</div>')) {
$body = mb_str_replace('</div>', '', $body);
}
if (mb_strpos($body, '<di') !== FALSE) {
$call_count[$id] ++;
// Gets the important junk
$rm = '<di' . xml_get($body, '<di', '>') . '>';
// Builds the replacements HTML
$replacement_html = '';
$next_id = count($replacements);
$replacement_id = "[[div-$next_id]]";
$replacements[$replacement_id] = $replacement_html;
$body = mb_str_replace($rm, $replacement_id, $body);
if (mb_strpos($body, '<di') !== FALSE && $call_count[$id] < 200) {
worker($body, $replacements, $id);
}
}
}
/**
* Returns text by specifying a start and end point
*
* @param str $str
* The text to search
* @param str $start
* The beginning identifier
* @param str $end
* The ending identifier
*/
function xml_get($str, $start, $end) {
$str = "|" . $str . "|";
$len = mb_strlen($start);
if (mb_strpos($str, $start) > 0) {
$int_start = mb_strpos($str, $start) + $len;
$temp = right($str, (mb_strlen($str) - $int_start));
$int_end = mb_strpos($temp, $end);
$return = trim(left($temp, $int_end));
return $return;
}
else {
return FALSE;
}
}
function right($str, $count) {
return mb_substr($str, ($count * -1));
}
function left($str, $count) {
return mb_substr($str, 0, $count);
}
/**
* Multibyte str replace
*/
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject, &$count = 0) {
if (!is_array($subject)) {
$searches = is_array($search) ? array_values($search) : array($search);
$replacements = is_array($replace) ? array_values($replace) : array($replace);
$replacements = array_pad($replacements, count($searches), '');
foreach ($searches as $key => $search) {
$parts = mb_split(preg_quote($search), $subject);
$count += count($parts) - 1;
$subject = implode($replacements[$key], $parts);
}
}
else {
foreach ($subject as $key => $value) {
$subject[$key] = mb_str_replace($search, $replace, $value, $count);
}
}
return $subject;
}
}
$html = <<<HTML
<table>
<tbody>
<tr>
<td class="votecell">
<div class="vote">
<input type="hidden" name="_id_" value="9607101">
<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>
<span itemprop="upvoteCount" class="vote-count-post ">0</span>
<a class="vote-down-off" title="This question does not show any research effort; it is unclear or not useful">down vote</a>
<a class="star-off" href="#">favorite</a>
<div class="favoritecount"><b></b></div>
</div>
</td>
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout. </p>
<p>Is there an easy an easy way to strip all occurrences of <code><div></code> and <code></div></code>?</p>
<p>str_replace won't work because some of the divs have styling and other things in them so it would need to account for <code><div style="some styling"> <div align="center"></code> etc</p>
<p>I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those. </p>
<p>Thanks a lot,
Martin
</p>
</div>
<div class="post-taglist">
<a href="/questions/tagged/php" class="post-tag js-gps-track" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/regex" class="post-tag js-gps-track" title="show questions tagged 'regex'" rel="tag">regex</a> <a href="/questions/tagged/replace" class="post-tag js-gps-track" title="show questions tagged 'replace'" rel="tag">replace</a> <a href="/questions/tagged/str-replace" class="post-tag js-gps-track" title="" rel="tag">str-replace</a> <a href="/questions/tagged/strip-tags" class="post-tag js-gps-track" title="show questions tagged 'strip-tags'" rel="tag">strip-tags</a>
</div>
<table class="fw">
<tbody>
<tr>
<td class="vt">
<div class="post-menu"><a href="/q/9607101" title="short permalink to this question" class="short-link" id="link-post-9607101">share</a><span class="lsep">|</span><a href="/posts/9607101/edit" class="suggest-edit-post" title="">improve this question</a></div>
</td>
<td align="right" class="post-signature">
<div class="user-info ">
<div class="user-action-time">
<a href="/posts/9607101/revisions" title="show all edits to this post">edited <span title="2012-03-07 18:32:29Z" class="relativetime">Mar 7 '12 at 18:32</span></a>
</div>
<div class="user-gravatar32">
</div>
<div class="user-details">
<div class="-flair">
</div>
</div>
</div>
</td>
<td class="post-signature owner">
<div class="user-info ">
<div class="user-action-time">
asked <span title="2012-03-07 18:31:11Z" class="relativetime">Mar 7 '12 at 18:31</span>
</div>
<div class="user-gravatar32">
<a href="/users/702826/martin-hunt">
<div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a578c3eae229c86dbe46d4b1603e071b?s=32&d=identicon&r=PG" alt="" width="32" height="32"></div>
</a>
</div>
<div class="user-details">
<a href="/users/702826/martin-hunt">Martin Hunt</a>
<div class="-flair">
<span class="reputation-score" title="reputation score " dir="ltr">313</span><span title="7 silver badges"><span class="badge2"></span><span class="badgecount">7</span></span><span title="20 bronze badges"><span class="badge3"></span><span class="badgecount">20</span></span>
</div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
<tr>
<td class="votecell"></td>
<td>
<div id="comments-9607101" class="comments ">
<table>
<tbody data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
<tr id="comment-12187969" class="comment ">
<td class="comment-actions">
<table>
<tbody>
<tr>
<td class=" comment-score">
<span title="number of 'useful comment' votes received" class="cool">1</span>
</td>
<td>
</td>
</tr>
</tbody>
</table>
</td>
<td class="comment-text">
<div style="display: block;" class="comment-body">
<span class="comment-copy">So you need to remove all the div tags but not the content between the div. Am I right?</span>
– <a href="/users/500725/siva-charan" title="14,075 reputation" class="comment-user">Siva Charan</a>
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12187969_9607101"><span title="2012-03-07 18:34:11Z" class="relativetime-clean">Mar 7 '12 at 18:34</span></a></span>
</div>
</td>
</tr>
<tr id="comment-12189778" class="comment ">
<td>
<table>
<tbody>
<tr>
<td class=" comment-score">
</td>
<td>
</td>
</tr>
</tbody>
</table>
</td>
<td class="comment-text">
<div style="display: block;" class="comment-body">
<span class="comment-copy"><a href="http://stackoverflow.com/a/4667535/208809">Replace the XPath with <code>//div[not[@*]]</code></a> to remove all div elements (incl. content) without attributes.</span>
– <a href="/users/208809/gordon" title="225,421 reputation" class="comment-user">Gordon</a>
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12189778_9607101"><span title="2012-03-07 19:58:21Z" class="relativetime-clean">Mar 7 '12 at 19:58</span></a></span>
<span class="edited-yes" title="this comment was edited 2 times"></span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<div id="comments-link-9607101" data-rep="50" data-anon="true">
<a class="js-add-link comments-link disabled-link " title="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.">add a comment</a><span class="js-link-separator dno"> | </span>
<a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick=""></a>
</div>
</td>
</tr>
</tbody>
</table>
HTML;
echo strip_divs($html);
答案 2 :(得分:-1)
没有。您使用正则表达式执行 NOT 解析/操作HTML。
正则表达不能讨价还价。他们不能被推理。他们不懂html,他们没有grok xml。并且他们绝对会不停止,直到你的DOM树死了。
您使用htmlpurifier和/或DOM来操纵树。
答案 3 :(得分:-1)
strip_tags($str, '<div>');