使用Javascript如何从带有多个标签的html的reg表达式中删除标记

时间:2016-05-17 22:26:46

标签: javascript html regex

我有一个html字符串,我希望在没有<img />的情况下进行部署。目前我所拥有的是:

var myHTML = "<p><img class="alignnone size-full wp-image-2857" 
src="https://files.wordpress.com/2016/05/laptop.jpg?w=750&#038;h=545" 
alt="https://pixabay.com/en/laptop-printer-office-folder-graph-1016257/" 
width="750" height="545" /></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE 
STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER 
STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p>
<em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-
decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha 
POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p> 
DONE</p> "

我认为应该是这样的:

var myHTML2 = "<p></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE 
    STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER 
    STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p>
    <em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-
    decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha 
    POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p> 
    DONE</p> "

我尝试了什么:

myHTML.replace(/<(?!\s*\/?\s*p\b)[^>]*>/gi,'') 

但是这会删除字符串中的所有html,我只想删除<img />标记。

3 个答案:

答案 0 :(得分:1)

转发

由于可能出现的所有可能模糊的边缘情况,使用正则表达式解析HTML是不明智的,但似乎您可以控制HTML,因此您应该能够避免使用许多边缘案件正义报警呐喊。

描述

<img\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

Regular expression visualization

替换为: 没有

此正则表达式将执行以下操作:

  • 匹配整个img标记以包含任何子属性
  • 避免困难的边缘情况,使处理hmtl困难

实施例

直播演示 https://regex101.com/r/pG1oI7/1

示例字符串

<p><img class="alignnone size-full wp-image-2857" 
src="https://files.wordpress.com/2016/05/laptop.jpg?w=750&#038;h=545" 
alt="https://pixabay.com/en/laptop-printer-office-folder-graph-1016257/" 
width="750" height="545" /></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE 
STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER 
STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p>
<em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-
decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha 
POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p> 
DONE</p> 

替换后

<p></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE 
STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER 
STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p>
<em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-
decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha 
POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p> 
DONE</p> 

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  <img                     '<img'
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  >                        '>'

答案 1 :(得分:1)

您可以使用此正则表达式删除img标记:

<img[^>]+>

老老实实,我不知道你要用正则表达式做什么。它并不需要复杂,唯一的正则表达式构造&#34;我必须使用的是[^>]+,它只匹配不是>的字符。

使用简单正则表达式的好处是可读性和速度。当然,如果你想考虑边缘情况(例如嵌入式JS中的误报),你应该使用HTML解析器。

答案 2 :(得分:1)

这不是正则表达式的答案,但如果您已经在使用javascript,则可以使用javascript的设计并直接操作DOM

require 'prawn'

Prawn::Document.generate("my_text.pdf") do

  font('Helvetica', size: 50) do
    formatted_text_box(
      [{text: 'Whatever text you need to print'}],
      at: [0, bounds.top],
      width: 100,
      height: 50,
      overflow: :shrink_to_fit,
      disable_wrap_by_char: true  # <---- newly added in 1.2
    )
  end
end

如果要获得大量图像,则需要使用类或ID。