正则表达式不返回字符串如果是另一个

时间:2017-03-17 17:25:02

标签: regex

鉴于此正则表达式查询字符串:

(?:<.*>)?(?:.*)?("|quot;)(.*)(\1)(?:.*)?(?:<.*>)?(?:http(s)?:\/\/)?(?:w{3})?plainview.io\/archives\/(\w+)(?:.*)?(?:<.*>)?

我需要能够选择

“部长,戴着黑框眼镜的忠诚党员告诉我们。他是这项工作的最佳人选。”

我可以从以下文字中做到:

<p>This is some text before, &quot;minister, a loyal party member with black rimmed glasses told us. He&#039;s the best man for the job.&quot; www.plainview.io/archives/SysteBvsl</a> and some text after</p>

但不是以下内容:

<p>This is some text before, &quot;minister, a loyal party member with black rimmed glasses told us. He&#039;s the best man for the job.&quot; <a href="https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.plainview.io%2Farchives%2FSysteBvsl&amp;h=ATPBq9DrC_xIokWhmxk7f3nyKGofYnM9zGt3mF-7bfMNNupsX0WSR4TdE6VmX6W9gd_1Rnby1nXfIfq3MzgOS2PKryxKu9z3yci0ZvomiLHvYbVSfuwg29Y1Z_R1LEKRDXO3sAOZ2dsMgQ&amp;enc=AZMnRgfaZaV-J1wtvqulToF-RxOlkhgY6kzmkLuXSv26a0waxI3nHsI1rXkl-ILjrXkcnwajsVFizefc27K5A_WlqpJrNQLKWSTnDSwIGHGHYvWDp1CWeBP8vbzcQZcnJHA-ka3LvpJIYIO7_YwPaEpKsT0I0nNewd0aHZYbPtHghob7_7a_fubIkIy5g3R7ExA&amp;s=1" target="_blank" rel="nofollow" onmouseover="LinkshimAsyncLink.swap(this, &quot;http:\\/\\/www.plainview.io\\/archives\\/SysteBvsl&quot;);" onclick="LinkshimAsyncLink.referrer_log(this, &quot;http:\\/\\/www.plainview.io\\/archives\\/SysteBvsl&quot;, &quot;\\/si\\/ajax\\/l\\/render_linkshim_log\\/?u=http\\u00253A\\u00252F\\u00252Fwww.plainview.io\\u00252Farchives\\u00252FSysteBvsl&amp;h=ATPBq9DrC_xIokWhmxk7f3nyKGofYnM9zGt3mF-7bfMNNupsX0WSR4TdE6VmX6W9gd_1Rnby1nXfIfq3MzgOS2PKryxKu9z3yci0ZvomiLHvYbVSfuwg29Y1Z_R1LEKRDXO3sAOZ2dsMgQ&amp;enc=AZMnRgfaZaV-J1wtvqulToF-RxOlkhgY6kzmkLuXSv26a0waxI3nHsI1rXkl-ILjrXkcnwajsVFizefc27K5A_WlqpJrNQLKWSTnDSwIGHGHYvWDp1CWeBP8vbzcQZcnJHA-ka3LvpJIYIO7_YwPaEpKsT0I0nNewd0aHZYbPtHghob7_7a_fubIkIy5g3R7ExA&amp;d&quot;);">www.plainview.io/archives/SysteBvsl</a> and some text after</p>

相反,对于后者,我得到了

\\/si\\/ajax\\/l\\/render_linkshim_log\\/?u=http\\u00253A\\u00252F\\u00252Fwww.plainview.io\\u00252Farchives\\u00252FSysteBvsl&amp;h=ATPBq9DrC_xIokWhmxk7f3nyKGofYnM9zGt3mF-7bfMNNupsX0WSR4TdE6VmX6W9gd_1Rnby1nXfIfq3MzgOS2PKryxKu9z3yci0ZvomiLHvYbVSfuwg29Y1Z_R1LEKRDXO3sAOZ2dsMgQ&amp;enc=AZMnRgfaZaV-J1wtvqulToF-RxOlkhgY6kzmkLuXSv26a0waxI3nHsI1rXkl-ILjrXkcnwajsVFizefc27K5A_WlqpJrNQLKWSTnDSwIGHGHYvWDp1CWeBP8vbzcQZcnJHA-ka3LvpJIYIO7_YwPaEpKsT0I0nNewd0aHZYbPtHghob7_7a_fubIkIy5g3R7ExA&amp;d&

为什么当我添加更多文本时(实际上是我需要的字符串之后),它会选择后面的文本?

1 个答案:

答案 0 :(得分:1)

您应该了解正则表达式如何在内部运行。

这里的问题主要是(太)复杂的正则表达式与贪婪相结合:

(?:<.*>)?(?:.*?)?("|quot;)(.*)(\1)(?:.*)?(?:<.*>)?(?:http(s)?:\/\/)?(?:w{3})?plainview.io\/archives\/(\w+)(?:.*)?(?:<.*>)?

将解决您的问题。我在这里所做的只是将(?:.*)替换为(?:.*?)(添加?)。

我刚发现的一个好资源是Why Using the Greedy .* in Regular Expressions Is Almost Never What You Actually Want

获得相同结果的一种更简单的方法是这个正则表达式:

&quot;(.*?)&quot;