我正在使用cUrl提取网站的内容。我想然后浏览所有内容并获取所有超链接(仅限锚标签),我没有使用正则表达式的运气,并指向loadXML(内容)和loadHTML(内容)方向。对于初学者,我不确定我正在检查的网站是XHTML(loadXML)还是纯HTML,在这种情况下我会使用HTML,当我使用其中任何一个时我都会收到错误,例如“打开和结束标记不匹配”。我像其他人建议的那样尝试了strictErrorChecking = FALSE和libxml_use_internal_errors(true),并尝试了tidy_parse_string,但这并没有返回任何内容。只是想知道是否有人有类似的问题或知道另一种方式,也许有一个简单的正则表达式来获取所有的超链接?
由于
答案 0 :(得分:1)
这是漫长的路线...为什么不用file_get_html
http://simplehtmldom.sourceforge.net
示例
include 'simple_html_dom.php';
$url = "http://php.net/";
$html = file_get_html ( $url );
echo "<pre>";
foreach ( $html->find ( 'a' ) as $element ) {
$link = $element->href;
$link = ltrim ( $link, "/" );
if (!preg_match ( "/http/i", $link )) {
$link = $url . $link;
}
echo $link . PHP_EOL;
flush ();
}
输出
http://php.net/
http://php.net/downloads.php
http://php.net/docs.php
http://php.net/FAQ.php
http://php.net/support.php
http://php.net/mailing-lists.php
http://php.net/license
https://wiki.php.net/
https://bugs.php.net/
http://php.net/sites.php
http://php.net/links.php
http://php.net/conferences/
http://php.net/my.php
http://php.net/tut.php
http://php.net/docs.php
http://php.net/links.php
http://php.net/usage.php
http://php.net/thanks.php
http://www.easydns.com/?V=698570efeb62a6e2
http://www.directi.com/
http://promote.pair.com/direct.pl?php.net
http://www.servercentral.net/
http://www.hostedsolutions.com/
http://www.spry.com/
http://www.osuosl.org
http://www.yahoo.com/
http://www.nexcess.net/
http://www.rackspace.com/
http://www.eukhost.com/
http://www.sohosted.nl/webhosting/
http://www.redpill-linpro.com
http://www.facebook.com
http://krystal.co.uk
http://servergrove.com/
http://www.bauer-kirch.de/
http://www.apache.org/
http://www.mysql.com/
http://www.postgresql.org/
http://www.zend.com/
http://www.linuxfund.org/
http://ostg.com/
http://php.net/feed.atom
http://php.net/downloads.php#v5
http://php.net/downloads.php#v5
http://php.net/submit-event.php
http://php.net/cal.php?id=2662
http://php.net/cal.php?id=3422
http://php.net/cal.php?id=4019
http://php.net/cal.php?id=1099
http://php.net/cal.php?id=4767
http://php.net/cal.php?id=1745
http://php.net/cal.php?id=1860
http://php.net/cal.php?id=2301
http://php.net/cal.php?id=2814
http://php.net/cal.php?id=3294
http://php.net/cal.php?id=2352
http://php.net/cal.php?id=2682
http://php.net/cal.php?id=3793
http://php.net/cal.php?id=109
http://php.net/cal.php?id=272
http://php.net/cal.php?id=561
http://php.net/cal.php?id=1005
http://php.net/cal.php?id=1304
http://php.net/cal.php?id=1624
http://php.net/cal.php?id=1632
http://php.net/cal.php?id=1706
http://php.net/cal.php?id=1918
http://php.net/cal.php?id=2017
http://php.net/cal.php?id=2418
http://php.net/cal.php?id=2734
http://php.net/cal.php?id=2932
http://php.net/cal.php?id=3416
http://php.net/cal.php?id=3861
http://php.net/cal.php?id=4014
http://php.net/cal.php?id=4147
http://php.net/cal.php?id=4799
http://php.net/cal.php?id=153
http://php.net/cal.php?id=2663
http://php.net/cal.php?id=1923
http://php.net/cal.php?id=2540
http://php.net/cal.php?id=4720
http://php.net/cal.php?id=1385
http://php.net/cal.php?id=1523
http://php.net/cal.php?id=1670
http://php.net/cal.php?id=1652
http://php.net/cal.php?id=1665
http://php.net/cal.php?id=1847
http://php.net/cal.php?id=3643
http://php.net/cal.php?id=3980
http://php.net/cal.php?id=4222
http://php.net/cal.php?id=4511
http://php.net/cal.php?id=1395
http://php.net/cal.php?id=3684
http://php.net/cal.php?id=4512
http://php.net/cal.php?id=4751
http://php.net/cal.php?id=5017
http://php.net/cal.php?id=5212
http://php.net/cal.php?id=1848
http://php.net/cal.php?id=1946
http://php.net/cal.php?id=4636
http://php.net/cal.php?id=1732
http://php.net/cal.php?id=2580
http://php.net/cal.php?id=3722
http://php.net/cal.php?id=4258
http://php.net/cal.php?id=3760
http://php.net/cal.php?id=4308
http://php.net/cal.php?id=2246
http://php.net/cal.php?id=3708
http://php.net/cal.php?id=3761
http://php.net/cal.php?id=4725
http://php.net/cal.php?id=5222
http://php.net/cal.php?id=1545
http://php.net/cal.php?id=1546
http://php.net/cal.php?id=2208
http://php.net/cal.php?id=3925
http://php.net/cal.php?id=1704
http://php.net/cal.php?id=1719
http://php.net/cal.php?id=1820
http://php.net/cal.php?id=4507
http://php.net/cal.php?id=5092
http://php.net/cal.php?id=1131
http://php.net/cal.php?id=1346
http://php.net/cal.php?id=1671
http://php.net/cal.php?id=2449
http://php.net/cal.php?id=409
http://php.net/cal.php?id=384
http://php.net/cal.php?id=3075
http://php.net/cal.php?id=3653
http://php.net/cal.php?id=5135
http://php.net/cal.php?id=4256
http://php.net/cal.php?id=5052
http://php.net/cal.php?id=2662
http://php.net/cal.php?id=3422
http://php.net/cal.php?id=4019
http://php.net/cal.php?id=1099
http://php.net/cal.php?id=4648
http://php.net/cal.php?id=4767
http://php.net/cal.php?id=2527
http://php.net/cal.php?id=2600
http://php.net/cal.php?id=2660
http://php.net/cal.php?id=4626
http://php.net/cal.php?id=5276
http://php.net/cal.php?id=2500
http://php.net/cal.php?id=4922
http://php.net/cal.php?id=1316
http://php.net/cal.php?id=1708
http://php.net/cal.php?id=2499
http://php.net/cal.php?id=841
http://php.net/cal.php?id=1490
http://php.net/cal.php?id=5187
http://php.net/cal.php?id=2144
http://php.net/cal.php?id=3703
http://php.net/cal.php?id=5289
http://php.net/cal.php?id=5305
http://php.net/cal.php?id=1516
http://php.net/cal.php?id=2702
http://php.net/cal.php?id=3560
http://php.net/cal.php?id=2023
http://php.net/cal.php?id=4230
http://php.net/cal.php?id=338
http://php.net/cal.php?id=456
http://php.net/cal.php?id=641
http://php.net/cal.php?id=998
http://php.net/cal.php?id=1198
http://php.net/cal.php?id=1360
http://php.net/cal.php?id=1981
http://php.net/cal.php?id=2051
http://php.net/cal.php?id=3053
http://php.net/cal.php?id=5193
http://php.net/cal.php?id=5190
http://php.net/cal.php?id=5191
http://php.net/cal.php?id=5188
http://php.net/cal.php?id=5186
http://php.net/cal.php?id=5185
http://php.net/cal.php?id=5184
http://php.net/cal.php?id=5298
http://php.net/cal.php?id=5308
http://php.net/cal.php?id=3385
http://php.net/cal.php?id=3386
http://php.net/cal.php?id=1466
http://php.net/cal.php?id=1583
http://php.net/cal.php?id=5125
http://php.net/cal.php?id=5228
http://php.net/cal.php?id=5194
http://php.net/cal.php?id=5251
http://php.net/cal.php?id=1389
http://php.net/cal.php?id=2408
http://php.net/cal.php?id=1200
http://php.net/cal.php?id=2589
http://php.net/cal.php?id=5247
http://php.net/cal.php?id=5279
http://php.net/cal.php?id=5303
http://php.net/cal.php?id=231
http://php.net/cal.php?id=5192
http://php.net/cal.php?id=5309
http://php.net/cal.php?id=1137
http://php.net/cal.php?id=4220
http://www.php.net/conferences/index.php#id2012-01-20-1
http://www.php.net/conferences/index.php#id2011-12-23-1
http://www.php.net/conferences/index.php#id2012-02-09-1
http://www.php.net/archive/2012.php#id2012-04-26-1
http://php.net/ChangeLog-5.php
http://php.net/downloads.php
http://windows.php.net/download/
http://www.php.net/archive/2012.php#id2012-04-13-1
http://qa.php.net
http://windows.php.net/qa/
http://git.php.net/?p=php-src.git;a=blob;f=NEWS;h=d647f8de7cf080b599a73e092d683273fbf744e8;hb=fa1437b144683eae4d253473c35e375f7b743811
http://php.net/mailto:php-qa@lists.php.net
https://bugs.php.net/
http://www.php.net/archive/2012.php#id2012-03-20-1
https://github.com/php/php-src
http://git.php.net/
http://php.net/git
https://wiki.php.net/vcs/gitfaq
http://www.php.net/archive/2012.php#id2012-03-01-1
http://php.net/downloads.php#v5.4.0
http://php.net/traits
http://docs.php.net/manual/en/language.types.array.php
http://php.net/manual/en/features.commandline.webserver.php
http://php.net/migration54
http://php.net/downloads.php#v5.4.0
http://php.net/downloads.php#v5.4.0
http://php.net/releases/5_4_0.php
http://php.net/ChangeLog-5.php
http://www.php.net/archive/2012.php#id2012-02-02-1
http://php.net/downloads.php
http://windows.php.net/download/
http://php.net/archive/index.php
http://php.net/feed.atom
http://php.net/source.php?url=/index.php
http://php.net/credits.php
http://php.net/stats/
http://php.net/sitemap.php
http://php.net/contact.php
http://php.net/contact.php#ads
http://php.net/mirrors.php
http://php.net/copyright.php
http://php.net/mirror.php
http://developer.yahoo.com/