Question

我遵循正则表达式来检测html文件中的开始和结束脚本标记：

<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

简而言之：＆lt; script NOT＆lt; / s＆gt; NOT＆lt; / s＆lt; / script＆gt;

它有效，但需要很长时间来检测＆lt; script＆gt;，长串即使是几分钟或几小时

lite版本即使对于长字符串工作也很完美：

<script[^<]*>[^<]*</script>

然而，扩展模式我也用于其他标签，例如＆lt; a＆gt;在哪里＆lt;和＆gt;可以作为属性值

python test for you：

import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()

我该如何解决？正则表达式的内部部分（在＆lt; script＆gt;之后）应该更改并简化。

PS :)预测你在html解析中使用正则表达式的错误方法的答案，我非常了解很多html / xml解析器，甚至更好的是我经常破坏的html代码，regex在这里非常有用。

注释：好吧，我需要处理：
每个＆lt; a＆lt;文件如this.border =“5px;”＆gt;
和方法是一起使用解析器和正则表达式 BeautifulSoup只有2k行，不处理每个html，只是从sgmllib扩展正则表达式。

主要原因是我必须准确知道每个标签开始和停止的位置。并且必须处理每个损坏的html。
BS不完美，有时会发生：
BeautifulSoup（'＆lt; scritt \ n \ n＆n;＆lt; aa＆gt; s＆lt; /script>').findAll('script'）== []

@Cylian：你知道原子分组在python的重新定义中是不可用的所以。*？的所有内容都是非格式的，直到＆lt; \ s * / \ s * tag \ s *＆gt; 为目前的胜利者。

我知道那种情况并不完美： re.search（'＆lt; \ s * script。？＆lt; \ s / \ s * script \ s *＆gt;'，'＆lt; script＆lt; / script＆gt; shit＆lt; / script＆gt ;'）。组（）但我可以在下次解析时处理拒绝尾部。

很明显，用正则表达式进行html解析不是一场战斗。

Answer 1

使用像beautifulsoup这样的HTML解析器。

查看“Can I remove script tags with beautifulsoup”的好答案。

如果您唯一的工具是锤子，那么每个问题都会像钉子一样开始。正则表达式是一个强有力的锤子，但并不总是解决某些问题的最佳解决方案。

我想你想要出于安全原因从用户发布的HTML中删除脚本。如果安全性是主要考虑因素，正则表达式很难实现，因为黑客可以修改很多东西以欺骗你的正则表达式，但是大多数浏览器都会高兴地评估...专业的解析器更容易使用，性能更好，更安全

如果您仍在考虑“为什么我不能使用正则表达式”，请阅读this answer评论中指出的mayhewr。我无法把它变得更好，这个家伙钉了它，他的4433票是当之无愧的。

Answer 2

我不懂python，但我知道正则表达式：

如果你使用贪婪/非贪婪的运算符，你会得到一个更简单的正则表达式：

<script.*?>.*?</script>

这是假设没有嵌套脚本。

Answer 3

模式中的问题在于它是回溯。使用原子组可以解决这个问题。将您的模式更改为**

<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>   
         ^^^^^                           ^^^^^

<强>解释

<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
      Match any character that is NOT a “<” «[^<]+?»
         Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
      Match the character “<” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
      Match any character that is NOT a “<” «[^<]+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
      Match the character “<” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->

沉重的正则表达式 - 真的很耗时

3 个答案: