Question

我正在编写一个脚本来从我的博客帖子中获取网址并对它们运行curl -I，这样我就可以检查它们是否仍然存在。但是我在编写grep模式时遇到了麻烦。

<p><a href="http://example.com/fujipol/2004/may/5/16:10:47/400x345">foobar</a></p>

所以我只想http://example.com/fujipol/2004/may/5/16:10:47/400x345。

或者像降价一样：

[Example markdown link](https://example.com)

想要https://example.com

<http://example.com/?foo=bar>

在这种情况下，我需要http://example.com/?foo=bar

Answer 1

使用示例中的链接创建文件：

$> cat ./text
<p><a href="http://example.com/fujipol/2004/may/5/16:10:47/400x345">foobar</a></p>
[Example markdown link](https://example.com)
<http://example.com/?foo=bar>
<a href="http://people.debian.org/~dilinger/backports/wordpress">http://people.debian.org/~dilinger/backports/wordpress</a>

使用一些正则表达式“Greped”它并从中获取所有URL：

$> grep --only-matching --perl-regexp "http(s?):\/\/[^ \"\(\)\<\>]*" ./text
http://example.com/fujipol/2004/may/5/16:10:47/400x345
https://example.com
http://example.com/?foo=bar
http://people.debian.org/~dilinger/backports/wordpress
http://people.debian.org/~dilinger/backports/wordpress

完成。

http(s?):\/\/[^ \"\(\)\<\>]*

我们在此处完成的工作与http(s)匹配{url可以从http://或https://开始），而不是匹配//并将其转义。最后，我们匹配的符号序列不等于或"或(或)或<或>。

最后，像这样的任务中的整个问题弄明白我如何决定我们需要的部分（在这种情况下为http(s)://）和结束（，"，{{ 1}}，(，)，<）。

坦率地说，这个解决方案并不是很完美。一些网址标准说明了关于网址可以包含或不包含的符号的更多信息。所以，你马上就会知道，在我的回答中使用正则表达式是无效的。但是，如果你描述它的作品有卖。

如何grep博客中的URL？

1 个答案: