用HTML标记替换封闭的Apostrophs但不在<code> blocks

时间:2018-06-04 16:36:44

标签: php regex

Goal: Modifying an HTML string that contains apostrophs for wrapping code inline (like Stackoverflow is doing it). But the same time having <code> blocks that can also contain apostrophs which should stay unchanged.

Example:

<p>This is my `inline code`, it can be replaced and tag-wrapped.</p>
<p><code>This text contains `apostrophs`, but should `not` be changed.</code></p>

This regex I am using for converting all wrapping apostrophs to <code> elements:

// replace apostroph with incorporating <code> tag
$content = preg_replace('/(.+?)\`(.+?)\`/', '$1<code class="inlinecode">$2</code>', $content);


Required:
Change the regex, so that it does not convert the apostroph if it is withing a <code> block.



Disclaimer: I tried for several hours to read the HTML string, use PHP's DOM parser, extract all nodes of type code, change their content, write them back, then found out that nodeValue is removing all HTML tags (especially the line breaks). Then tried several solutions found online, still not working... Now I am falling back to regex, even against the odds.

FYI, how I tried it the DOM way:

$code_blocks = $dom->getElementsByTagName('code');
foreach($code_blocks as $codenode) {
// nodeValue strips HTML tags, we need to hack
$nodevalue_html = $codenode->ownerDocument->saveXML($codenode);
// replace, i.e. custom-store each apostroph with '~~~APO~~~' so that they survive
$nodevalue_html = preg_replace('/`/', '~~~APO~~~', $nodevalue_html);
// $codenode->textValue = $nodevalue_html; // fail
// $codenode->nodeValue = $nodevalue_html; // fail
// ...
}
// html to string
$html_new = $dom->saveHTML();
$html_new = preg_replace('/~~~APO~~~/', '`', $html_new);

I wished I could use Markdown like Stackoverflow, but I still need to deal with HTML.

2 个答案:

答案 0 :(得分:1)

Using an XPath query to avoid text nodes that have a code element as ancestor:

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::code)][contains(.,"`")]');
foreach ($textNodes as $textNode) {
    $parts = (function($text) { yield from explode('`', $text); })($textNode->nodeValue);
    $frag = $dom->createDocumentFragment();
    do {
        $frag->appendChild($dom->createTextNode($parts->current()));
        $parts->next();
        if ( $parts->valid() ) {
            $codeElt = $dom->createElement('code');
            $codeElt->appendChild($dom->createTextNode($parts->current()));
            $frag->appendChild($codeElt);
            $parts->next();
        }
    } while ($parts->valid());
    $textNode->parentNode->replaceChild($frag, $textNode);
}
echo $dom->saveHTML();

demo

demo for php < 7.0

答案 1 :(得分:0)

I believe the only way is to explode and reassemble the string:

$html_string = '....................'; // contains apostrophes and <code>...</code> blocks

$delim = "<code>";
$closing_tag = "</code>";
$explode = explode($delim, $html_string);

foreach($explode as &$ex) {
    $closing_tag_pos = strpos($ex, $closing_tag);
    if ($closing_tag_pos !== false) {
        $pre_closing_tag = substr($ex, 0, $closing_tag_pos);
        $post_closing_tag = substr($ex, $closing_tag_pos);
        $ex = $pre_closing_tag . preg_replace('/`/', '~~~APO~~~', $post_closing_tag);
    }
}

$mapped_html_string = implode($delim, $explode);