Question

我有这个HTML：

"This is simple html text <script language="javascript">simple simple text text</script> text"

我只需要匹配脚本标记之外的单词。我的意思是如果我想匹配“简单”和“文本”我只能从“这是简单的html文本”和最后一部分“文本”得到结果 - 结果将是“简单”1匹配，“文本”2火柴。任何人都可以帮我这个吗？我正在使用PHP。

我在标签外找到匹配文字的类似答案：

(text|simple)(?![^<]*>|[^<>]*</)

Regex replace text outside html tags

但是不能为特定标签（脚本）投入工作：

(text|simple)(?!(^<script*>)|[^<>]*</)

ps：这个问题不重复（strip_tags, remove javascript）。因为我不是要剥离标签，或者选择脚本标签内的内容。我正在尝试替换标签“script”之外的内容。

Answer 1

我的模式将使用(*SKIP)(*FAIL)来取消匹配的脚本标记及其内容。

text和simple将在每次符合条件的事件中匹配。

正则表达式：~<script.*?/script>(*SKIP)(*FAIL)|text|simple~

Pattern / Replacement Demo Link

代码：（Demo）

$strings=['This has no replacements',
    'This simple text has no script tag',
    'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',
    'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',
    '<script language="javascript">simple simple text text</script> this text starts with a script tag'
];

$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);

var_export($strings);

输出：

array (
  0 => 'This has no replacements',
  1 => 'This ***replaced*** ***replaced*** has no script tag',
  2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',
  3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',
  4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',
)

Answer 2

如果确信script将会出现，那么只需与

匹配即可

(.*?)<script.*</script>(.*)

标记外的文字将显示在子匹配1和2中。如果script是可选的，请执行(.*?)(<script.*</script>)?(.*)。

Answer 3

这是另一种解决方案

([\w\s]*)(?:<script.*?\/script>)(.*)$

这是https://regex101.com/r/1Lthi8/1

上的演示

Answer 4

只是一个标签，就标签来说，不可能忽略单个标签
无需解析所有标签。

你可以 SKIP / FAIL 过去的html标签和隐藏的内容这将找到您正在寻找的单词。

https://regex101.com/r/7ZGlvW/1

格式化

    <
    (?:
         (?:
              (?:
                                                 # Invisible content; end tag req'd
                   (                             # (1 start)
                        script
                     |  style
                     |  object
                     |  embed
                     |  applet
                     |  noframes
                     |  noscript
                     |  noembed 
                   )                             # (1 end)
                   (?:
                        \s+ 
                        (?>
                             " [\S\s]*? "
                          |  ' [\S\s]*? '
                          |  (?:
                                  (?! /> )
                                  [^>] 
                             )?
                        )+
                   )?
                   \s* >
              )

              [\S\s]*? </ \1 \s* 
              (?= > )
         )

      |  (?: /? [\w:]+ \s* /? )
      |  (?:
              [\w:]+ 
              \s+ 
              (?:
                   " [\S\s]*? " 
                |  ' [\S\s]*? ' 
                |  [^>]? 
              )+
              \s* /?
         )
      |  \? [\S\s]*? \?
      |  (?:
              !
              (?:
                   (?: DOCTYPE [\S\s]*? )
                |  (?: \[CDATA\[ [\S\s]*? \]\] )
                |  (?: -- [\S\s]*? -- )
                |  (?: ATTLIST [\S\s]*? )
                |  (?: ENTITY [\S\s]*? )
                |  (?: ELEMENT [\S\s]*? )
              )
         )
    )
    >
    (*SKIP)
    (?!)
 |  
    (?: text | simple )

或者，更快的方法是匹配两个标签 AND 您正在发送的文本寻找。

匹配标签会移过它们。

如果您正在进行替换，请使用回调来确定要替换的内容第1组是 TAG 或不可见内容运行 第3组是您要替换的词。

因此，在回调中，如果组1匹配，则只返回组1 如果组3匹配，请替换为您要更换的组。

正则表达式

https://regex101.com/r/7ZGlvW/2

此正则表达式与SAX和DOM解析器解析标记的方式相当我在SO上发布了数百次。

以下是如何删除所有html标记的示例：

https://regex101.com/r/oCVkZv/1

正则表达式替换脚本标记之外的文本

4 个答案: