鉴于此正则表达式查询字符串:
(?:<.*>)?(?:.*)?("|quot;)(.*)(\1)(?:.*)?(?:<.*>)?(?:http(s)?:\/\/)?(?:w{3})?plainview.io\/archives\/(\w+)(?:.*)?(?:<.*>)?
我需要能够选择
“部长,戴着黑框眼镜的忠诚党员告诉我们。他是这项工作的最佳人选。”我可以从以下文字中做到:
<p>This is some text before, "minister, a loyal party member with black rimmed glasses told us. He's the best man for the job." www.plainview.io/archives/SysteBvsl</a> and some text after</p>
但不是以下内容:
<p>This is some text before, "minister, a loyal party member with black rimmed glasses told us. He's the best man for the job." <a href="https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.plainview.io%2Farchives%2FSysteBvsl&h=ATPBq9DrC_xIokWhmxk7f3nyKGofYnM9zGt3mF-7bfMNNupsX0WSR4TdE6VmX6W9gd_1Rnby1nXfIfq3MzgOS2PKryxKu9z3yci0ZvomiLHvYbVSfuwg29Y1Z_R1LEKRDXO3sAOZ2dsMgQ&enc=AZMnRgfaZaV-J1wtvqulToF-RxOlkhgY6kzmkLuXSv26a0waxI3nHsI1rXkl-ILjrXkcnwajsVFizefc27K5A_WlqpJrNQLKWSTnDSwIGHGHYvWDp1CWeBP8vbzcQZcnJHA-ka3LvpJIYIO7_YwPaEpKsT0I0nNewd0aHZYbPtHghob7_7a_fubIkIy5g3R7ExA&s=1" target="_blank" rel="nofollow" onmouseover="LinkshimAsyncLink.swap(this, "http:\\/\\/www.plainview.io\\/archives\\/SysteBvsl");" onclick="LinkshimAsyncLink.referrer_log(this, "http:\\/\\/www.plainview.io\\/archives\\/SysteBvsl", "\\/si\\/ajax\\/l\\/render_linkshim_log\\/?u=http\\u00253A\\u00252F\\u00252Fwww.plainview.io\\u00252Farchives\\u00252FSysteBvsl&h=ATPBq9DrC_xIokWhmxk7f3nyKGofYnM9zGt3mF-7bfMNNupsX0WSR4TdE6VmX6W9gd_1Rnby1nXfIfq3MzgOS2PKryxKu9z3yci0ZvomiLHvYbVSfuwg29Y1Z_R1LEKRDXO3sAOZ2dsMgQ&enc=AZMnRgfaZaV-J1wtvqulToF-RxOlkhgY6kzmkLuXSv26a0waxI3nHsI1rXkl-ILjrXkcnwajsVFizefc27K5A_WlqpJrNQLKWSTnDSwIGHGHYvWDp1CWeBP8vbzcQZcnJHA-ka3LvpJIYIO7_YwPaEpKsT0I0nNewd0aHZYbPtHghob7_7a_fubIkIy5g3R7ExA&d");">www.plainview.io/archives/SysteBvsl</a> and some text after</p>
相反,对于后者,我得到了
\\/si\\/ajax\\/l\\/render_linkshim_log\\/?u=http\\u00253A\\u00252F\\u00252Fwww.plainview.io\\u00252Farchives\\u00252FSysteBvsl&h=ATPBq9DrC_xIokWhmxk7f3nyKGofYnM9zGt3mF-7bfMNNupsX0WSR4TdE6VmX6W9gd_1Rnby1nXfIfq3MzgOS2PKryxKu9z3yci0ZvomiLHvYbVSfuwg29Y1Z_R1LEKRDXO3sAOZ2dsMgQ&enc=AZMnRgfaZaV-J1wtvqulToF-RxOlkhgY6kzmkLuXSv26a0waxI3nHsI1rXkl-ILjrXkcnwajsVFizefc27K5A_WlqpJrNQLKWSTnDSwIGHGHYvWDp1CWeBP8vbzcQZcnJHA-ka3LvpJIYIO7_YwPaEpKsT0I0nNewd0aHZYbPtHghob7_7a_fubIkIy5g3R7ExA&d&
为什么当我添加更多文本时(实际上是我需要的字符串之后),它会选择后面的文本?
答案 0 :(得分:1)
您应该了解正则表达式如何在内部运行。
这里的问题主要是(太)复杂的正则表达式与贪婪相结合:
(?:<.*>)?(?:.*?)?("|quot;)(.*)(\1)(?:.*)?(?:<.*>)?(?:http(s)?:\/\/)?(?:w{3})?plainview.io\/archives\/(\w+)(?:.*)?(?:<.*>)?
将解决您的问题。我在这里所做的只是将(?:.*)
替换为(?:.*?)
(添加?
)。
我刚发现的一个好资源是Why Using the Greedy .* in Regular Expressions Is Almost Never What You Actually Want
获得相同结果的一种更简单的方法是这个正则表达式:
"(.*?)"