我想从html页面中提取所有锚标签。我在Linux中使用它。
lynx --source http://www.imdb.com | egrep "<a[^>]*>"
但由于结果包含不需要的结果,因此无法按预期工作
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>
我想要
<a href >...</a>
有什么好办法吗?
答案 0 :(得分:5)
如果你的grep中有-P
选项,以便它接受PCRE模式,你应该能够使用更好的正则表达式。有时像*?
这样的最小量词会有所帮助。而且,你得到整个输入线,而不仅仅是匹配本身;如果grep有-o
选项,它将仅列出匹配的部分。
egrep -Po '<a[^<>]*>'
如果您的grep没有这些选项,请尝试
perl -00 -nle 'print $1 while /(<a[^<>]*>)/gi'
现在越过行边界。
要对HTML进行真正的解析,要求正则表达式比您希望在命令行中输入要复杂得多。 Here’s one example和here’s another。这些可能无法说服你尝试非正则表达式方法,但它们至少应该告诉你在一般情况下比在特定情况下更难。
这个答案说明了为什么all things are possible, but not all are expedient.
答案 1 :(得分:2)
为什么不能使用--dump
等选项?
lynx --dump --listonly http://www.imdb.com
答案 2 :(得分:0)
尝试grep -Eo
:
$ echo '<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>' | grep -Eo '<a[^>]*>'
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">
但请阅读MAK链接的答案。
答案 3 :(得分:0)
Here's some examples of why you should not use regex to parse html
要提取锚标记的'href'
属性值,请运行:
$ python -c'import sys, lxml.html as h
> root = h.parse(sys.argv[1]).getroot()
> root.make_links_absolute(base_url=sys.argv[1])
> print "\n".join(root.xpath("//a/@href"))' http://imdb.com | sort -u
如果需要,请安装lxml
模块:$ sudo apt-get install python-lxml
。
http://askville.amazon.com http://idfilm.blogspot.com/2011/02/another-class.html http://imdb.com http://imdb.com/ http://imdb.com/a2z http://imdb.com/a2z/ http://imdb.com/advertising/ http://imdb.com/boards/ http://imdb.com/chart/ http://imdb.com/chart/top http://imdb.com/czone/ http://imdb.com/features/hdgallery http://imdb.com/features/oscars/2011/ http://imdb.com/features/sundance/2011/ http://imdb.com/features/video/ http://imdb.com/features/video/browse/ http://imdb.com/features/video/trailers/ http://imdb.com/features/video/tv/ http://imdb.com/features/yearinreview/2010/ http://imdb.com/genre http://imdb.com/help/ http://imdb.com/helpdesk/contact http://imdb.com/help/show_article?conditions http://imdb.com/help/show_article?rssavailable http://imdb.com/jobs http://imdb.com/lists http://imdb.com/media/index/rg2392693248 http://imdb.com/media/rm3467688448/rg2392693248 http://imdb.com/media/rm3484465664/rg2392693248 http://imdb.com/media/rm3719346688/rg2392693248 http://imdb.com/mymovies/list http://imdb.com/name/nm0000207/ http://imdb.com/name/nm0000234/ http://imdb.com/name/nm0000631/ http://imdb.com/name/nm0000982/ http://imdb.com/name/nm0001392/ http://imdb.com/name/nm0004716/ http://imdb.com/name/nm0531546/ http://imdb.com/name/nm0626362/ http://imdb.com/name/nm0742146/ http://imdb.com/name/nm0817980/ http://imdb.com/name/nm2059117/ http://imdb.com/news/ http://imdb.com/news/celebrity http://imdb.com/news/movie http://imdb.com/news/ni7650335/ http://imdb.com/news/ni7653135/ http://imdb.com/news/ni7654375/ http://imdb.com/news/ni7654598/ http://imdb.com/news/ni7654810/ http://imdb.com/news/ni7655320/ http://imdb.com/news/ni7656816/ http://imdb.com/news/ni7660987/ http://imdb.com/news/ni7662397/ http://imdb.com/news/ni7665028/ http://imdb.com/news/ni7668639/ http://imdb.com/news/ni7669396/ http://imdb.com/news/ni7676733/ http://imdb.com/news/ni7677253/ http://imdb.com/news/ni7677366/ http://imdb.com/news/ni7677639/ http://imdb.com/news/ni7677944/ http://imdb.com/news/ni7678014/ http://imdb.com/news/ni7678103/ http://imdb.com/news/ni7678225/ http://imdb.com/news/ns0000003/ http://imdb.com/news/ns0000018/ http://imdb.com/news/ns0000023/ http://imdb.com/news/ns0000031/ http://imdb.com/news/ns0000128/ http://imdb.com/news/ns0000136/ http://imdb.com/news/ns0000141/ http://imdb.com/news/ns0000195/ http://imdb.com/news/ns0000236/ http://imdb.com/news/ns0000344/ http://imdb.com/news/ns0000345/ http://imdb.com/news/ns0004913/ http://imdb.com/news/top http://imdb.com/news/tv http://imdb.com/nowplaying/ http://imdb.com/photo_galleries/new_photos/2010/ http://imdb.com/poll http://imdb.com/privacy http://imdb.com/register/login http://imdb.com/register/?why=footer http://imdb.com/register/?why=mymovies_footer http://imdb.com/register/?why=personalize http://imdb.com/rg/NAV_TWITTER/NAV_EXTRA/http://www.twitter.com/imdb http://imdb.com/ri/TRAILERS_HPPIRATESVID/TOP_BUCKET/102785/video/imdb/vi161323033/ http://imdb.com/search http://imdb.com/search/ http://imdb.com/search/name?birth_monthday=02-12 http://imdb.com/search/title?sort=num_votes,desc&title_type=feature&my_ratings=exclude http://imdb.com/sections/dvd/ http://imdb.com/sections/horror/ http://imdb.com/sections/indie/ http://imdb.com/sections/tv/ http://imdb.com/showtimes/ http://imdb.com/tiger_redirect?FT_LIC&licensing/ http://imdb.com/title/tt0078748/ http://imdb.com/title/tt0279600/ http://imdb.com/title/tt0377981/ http://imdb.com/title/tt0881320/ http://imdb.com/title/tt0990407/ http://imdb.com/title/tt1034389/ http://imdb.com/title/tt1265990/ http://imdb.com/title/tt1401152/ http://imdb.com/title/tt1411238/ http://imdb.com/title/tt1411238/trivia http://imdb.com/title/tt1446714/ http://imdb.com/title/tt1452628/ http://imdb.com/title/tt1464174/ http://imdb.com/title/tt1464540/ http://imdb.com/title/tt1477837/ http://imdb.com/title/tt1502404/ http://imdb.com/title/tt1504320/ http://imdb.com/title/tt1563069/ http://imdb.com/title/tt1564367/ http://imdb.com/title/tt1702443/ http://imdb.com/tvgrid/ http://m.imdb.com http://pro.imdb.com/r/IMDbTabNB/ http://resume.imdb.com http://resume.imdb.com/ https://secure.imdb.com/register/subscribe?c=a394d4442664f6f6475627 http://twitter.com/imdb http://wireless.amazon.com http://www.3news.co.nz/The-Hobbit-media-conference--full-video/tabid/312/articleID/198020/Default.aspx http://www.amazon.com/exec/obidos/redirect-home/internetmoviedat http://www.audible.com http://www.boxofficemojo.com http://www.dpreview.com http://www.endless.com http://www.fabric.com http://www.imdb.com/board/bd0000089/threads/ http://www.imdb.com/licensing/ http://www.imdb.com/media/rm1037220352/rg261921280 http://www.imdb.com/media/rm2695346688/tt1449283 http://www.imdb.com/media/rm3987585536/tt1092026 http://www.imdb.com/name/nm0000092/ http://www.imdb.com/photo_galleries/new_photos/2010/index http://www.imdb.com/search/title?sort=num_votes,desc&title_type=tv_series&my_ratings=exclude http://www.imdb.com/sections/indie/ http://www.imdb.com/title/tt0079470/ http://www.imdb.com/title/tt0079470/quotes?qt0471997 http://www.imdb.com/title/tt1542852/ http://www.imdb.com/title/tt1606392/ http://www.imdb.de http://www.imdb.es http://www.imdb.fr http://www.imdb.it http://www.imdb.pt http://www.movieline.com/2011/02/watch-jon-hamm-talk-butthole-surfers-paul-rudd-impersonate-jay-leno-at-book-reading-1.php http://www.movingimagesource.us/articles/un-tv-20110210 http://www.npr.org/blogs/monkeysee/2011/02/10/133629395/james-franco-recites-byron-to-the-worlds-luckiest-middle-school-journalist http://www.nytimes.com/2011/02/06/books/review/Brubach-t.html http://www.shopbop.com/welcome http://www.smallparts.com http://www.twinpeaks20.com/details/ http://www.twitter.com/imdb http://www.vanityfair.com/hollywood/features/2011/03/lauren-bacall-201103 http://www.warehousedeals.com http://www.withoutabox.com http://www.zappos.com
答案 4 :(得分:0)
要提取锚标签的'href'属性值,您还可以在使用HTML Tidy(2009年3月25日发布的Mac OS X版本)将HTML转换为XHTML后使用xmlstarlet:
curl -s www.imdb.com |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a/@href" -v '.' -n |
grep '^[[:space:]]*http://' | sort -u | nl
答案 5 :(得分:0)
在Mac OS X上,您还可以使用命令行工具linkscraper:
linkscraper http://www.imdb.com