Question

正则表达式

\<div class=g\>.*?\<a href=\"?(http:\/\/stackoverflow.com\/)\"?.*?\>.*?\<a href=\"?(.+?)\"?.*?\>.*?\<\/div\>

目标

<div class=g>
  <link rel=prefetch href="https://stackoverflow.com/">
  <h2 class=r>
    <a href="https://stackoverflow.com/" class=l onmousedown="return rwt(this,'','','dres','1','AFQjCNERidL9Hb6OvGW93_Y6MRj3aTdMVA','&amp;sig2=ybSqh-7yEKCGx_2MNIb7tA')">
      <em>Stack Overflow</em>
    </a>
  </h2>
  <table border=0 cellpadding=0 cellspacing=0>
    <tr>
      <td class=j>
        <font size=-1>
          <span class=f>Categoria: </span>
          <a href="/Top/Computers/Programming/Resources/Chats_and_Forums/?il=1">Computers&nbsp;&gt;&nbsp;Programming&nbsp;&gt;&nbsp;Resources&nbsp;&gt;&nbsp;Chats&nbsp;and&nbsp;Forums</a>
          <br>A language-independent collaboratively edited question and answer site for programmers. Questions and answers displayed by user votes and tags.<br>
          <span class=a><b>stackoverflow</b>.com/</span>
        </font>
      </td>
    </tr>
  </table>
</div>

它应匹配所有内容，https://stackoverflow.com/和/ Top / Computers / Programming / Resources / Chats_and_Forums /？il = 1，但它匹配所有内容，https://stackoverflow.com/和/

为什么？

Answer 1

那是因为你在第二组中的正则表达式不情愿地匹配（a.k.a. ungreedy match）。有关此内容的更多信息，请参阅：http://www.regular-expressions.info/repeat.html esc special paragraph 懒惰而不是贪婪。

这就是为什么它不像你期望的那样工作。

现在，关于修复你的问题：使用适当的解析器来处理这个或一些现有工具从html获取属性（jQuery可以很好地完成这个，我听说）。不要试图用正则表达式做到这一点：你可以让它适用于这种情况，但是下周你会再次来到这里因为其他事情已经破裂了。

祝你好运！

Answer 2

我绝对不是那些“omg之一，你在同一句话中说HTML和正则表达式，你必须死” -types，但这显然不是正则表达式是最好的情况工作的工具。（它甚至不是一个好工具，也不是一个有效的工具）。

使用XML / HTML解析器解析它，为您的同事节省很多麻烦和滥用。

Answer 3

问题是这个......

(.*?)

为什么要在这里放置问号？有了这个，你只在搜索中得到'/'，因为？确保零或一次返回。如果用以下内容替换它......

([^"]+)

其中查找不是双引号的所有值，您应该获取所有内容，stackoverflow href和您提到的其他href。

我不完全确定你为什么要这样做。您可能在不必要时使用正则表达式。这个正则表达式的目的是什么，看起来有点矫枉过正。

regexp应匹配网站类别网址，但匹配/

3 个答案: