获取不是来自“ example.com”的链接

时间:2019-11-04 20:47:54

标签: regex

我有以下文字:

&#32; submitted by &#32; <a href="https://www.reddit.com/user/Leon91"> /u/Leon91 </a> <br/> <span><a href="https://www.dailymail.co.uk/news/article-7646171/Jared-Kushner-greenlit-arrest-Jamal-Khashoggi-phone-call-Saudi-Prince.html">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/worldnews/comments/drfnas/jared_kushner_greenlit_arrest_of_jamal_khashoggi/">[comments]</a></span>

我想获取不是来自reddit.com的所有链接,例如链接https://www.dailymail.co.uk/news/article-7646171/Jared-Kushner-greenlit-arrest-Jamal-Khashoggi-phone-call-Saudi-Prince.html的结果。

我尝试了以下匹配所有URL的内容:

(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})

但是,我想要所有非reddit.com的网址。

任何建议如何解决这个问题?

感谢您的答复!

1 个答案:

答案 0 :(得分:3)

使用不包含 'a' 的正则表达式来获取所有 reddit.com 标签href链接可以这样完成:

该链接在组2中捕获。

<a(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])((?:(?!\1|reddit\.com)[\S\s])+)\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

https://regex101.com/r/UxKB0a/1