正则表达式不使用网络爬虫

时间:2014-12-08 07:00:01

标签: php regex web web-crawler

我有这个简单的网页抓取工具,可以从Google搜索结果页面返回所有链接(标记),但是,我的preg_match函数似乎没有返回我想要的2个字符串之间的相关链接。我相信我的正则表达式是正确的,我已经在其他几个平台上进行了测试。

foreach($html->find('a') as $element) { 

preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //attempt to retrieve the      actual link in between these strings

echo  $element->href.'<br/>'; //prints out each of the links 

}

print_r($matches);

以下是我试图检索相关链接的页面,当我正在搜索名为John Smith的人时

https://www.google.com/webhp?tab=ww
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbm=isch&source=og&sa=N&tab=wi
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wl
https://play.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=w8
https://www.youtube.com/results?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
http://www.google.com/intl/en/options/
https://www.google.com/calendar?tab=wc
https://translate.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wT
http://www.google.com/mobile/?hl=en&tab=wD
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=bks&source=og&sa=N&tab=wp
https://wallet.google.com/manage/?tab=wa
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=shop&source=og&sa=N&tab=wf
https://www.blogger.com/?tab=wj
https://www.google.com/finance?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=we
https://plus.google.com/photos?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=wq
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=vid&source=og&sa=N&tab=wv
http://www.google.com/intl/en/options/
https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search%3Fq%3DJohn%2BSmith
http://www.google.com/preferences?hl=en
/preferences?hl=en
http://www.google.com/history/optout?hl=en
/webhp?hl=en
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=isch&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAUQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=vid&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAYQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=nws&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAcQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=shop&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAgQ_AU
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAkQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=bks&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAoQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:h&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:d&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:w&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:m&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:y&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=li:1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBQQFjAA&usg=AFQjCNFgBV3CPR5ydtty6z72kDKto_Ij7A
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:2n5isO4EbUAJ:http://en.wikipedia.org/wiki/John_Smith_(explorer)%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBcQIDAA&usg=AFQjCNGxUvb-aHUJmV-p4VbGXmUJE1nPBw
/search?ie=UTF-8&q=related:en.wikipedia.org/wiki/John_Smith_(explorer)+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBgQHzAA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Early_adventures&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBoQ0gIoADAA&usg=AFQjCNFK7RzMUfQA5LZYUNaL2C_K0cEbjA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23In_Jamestown&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBsQ0gIoATAA&usg=AFQjCNF0pFVxwtdohofHr3bWQXJhk1XMcA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23New_England&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBwQ0gIoAjAA&usg=AFQjCNE4VqtjkQwsNzO_haCNSUi-3bgTsw
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Death_and_burial&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB0Q0gIoAzAA&usg=AFQjCNFAr4O8yWEK93_GyyN6_srpqLaljQ
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB8QFjAB&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:iuJ7Uh7IOtgJ:http://www.apva.org/history/jsmith.html%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCIQIDAB&usg=AFQjCNG_keb3HZAHUteBGMb3k5GTIeVr5w
/search?ie=UTF-8&q=related:www.apva.org/history/jsmith.html+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCMQHzAB
/images?q=John+Smith&hl=en&sa=X&oi=image_result_group&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCUQsAQ
/url?q=http://etc.usf.edu/clipart/200/269/smith_2.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCcQ9QEwAg&usg=AFQjCNF3B9TL94enKovOL1hlz-n0A4PXrA
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCkQ9QEwAw&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCsQ9QEwBA&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ
/url?q=http://www.shmoop.com/jamestown/photo-john-smith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC0Q9QEwBQ&usg=AFQjCNFvEq7Cq3P6WdNIIHpNVVuQLTMhdQ
/url?q=http://www.wpclipart.com/American_History/settlement/John_Smith/Captain_John_Smith.png.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC8Q9QEwBg&usg=AFQjCNGEWlYKoQUhODn-3jypeyaw4urAGw
/url?q=http://www.web-books.com/Classics/ON/B1/B1583/07MB1583.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDEQ9QEwBw&usg=AFQjCNGSF2DNQHhwDTHz4ogVcLVhM5TiDQ
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDMQFjAI&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:IJvKbJ_a540J:http://www.biography.com/people/john-smith-9486928%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDYQIDAI&usg=AFQjCNHnW1ezRcv8sn_Jk3GBvECp-QOCTg
/search?ie=UTF-8&q=related:www.biography.com/people/john-smith-9486928+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDcQHzAI
/url?q=http://johnsmithjohnsmith.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDkQFjAJ&usg=AFQjCNH9a_jF2woyDESMRrLneIIbbTeS4g
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:_KyTfWhQuFEJ:http://johnsmithjohnsmith.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDwQIDAJ&usg=AFQjCNGX37w0NUcEFa0t04-28gLhlMVfdA
/search?ie=UTF-8&q=related:johnsmithjohnsmith.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD0QHzAJ
/url?q=http://www.johnsmith.co.uk/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD8QFjAK&usg=AFQjCNHEhG7WRm1dP5c_0xqqH0P0U-9jUA
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:jPrP5TbGXhYJ:http://www.johnsmith.co.uk/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEIQIDAK&usg=AFQjCNFe-QSMSKMs8Z6mSu-oLraaeKYAug
/search?ie=UTF-8&q=related:www.johnsmith.co.uk/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEMQHzAK
/url?q=http://www.johnsmith.co.uk/uel&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEUQ0gIoADAK&usg=AFQjCNEk2GkTaQvtpqaaYdztlWV7iVs0Jg
/url?q=http://www.johnsmith.co.uk/bedfordshire&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEYQ0gIoATAK&usg=AFQjCNFcOIItpAW46XRn1BwGvuG7mertRA
/url?q=http://www.johnsmith.co.uk/aru&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEcQ0gIoAjAK&usg=AFQjCNFq68oEVG7KAAu-Mbd0ScBFOMF4MA
/url?q=http://www.history.com/topics/john-smith&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEkQFjAL&usg=AFQjCNGytp4P2oI3szUVSzJbJ1YdOWDldw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:5hQtC90uVmYJ:http://www.history.com/topics/john-smith%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEwQIDAL&usg=AFQjCNERGtQrhvZLOovq8W-Mp8AXeT_W1g
/search?ie=UTF-8&q=related:www.history.com/topics/john-smith+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE0QHzAL
/url?q=http://johnsmithmusic.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE8QFjAM&usg=AFQjCNFlpAC8HDml6r5DpmAo4VviZ_GeMw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:-T7dO31PjlkJ:http://johnsmithmusic.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFIQIDAM&usg=AFQjCNFFeePBNGGMWPaVS9j4_niZpMVyxA
/search?ie=UTF-8&q=related:johnsmithmusic.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFMQHzAM
/url?q=http://www.nps.gov/jame/historyculture/life-of-john-smith.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFUQFjAN&usg=AFQjCNHPmqp05pAUp2yk1R9aKPqohTmWpQ
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:Q_nfCPRpnwQJ:http://www.nps.gov/jame/historyculture/life-of-john-smith.htm%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFgQIDAN&usg=AFQjCNHad3eFxSDuthM23n4FcusD5rY1uw
/search?ie=UTF-8&q=related:www.nps.gov/jame/historyculture/life-of-john-smith.htm+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFkQHzAN
/url?q=http://www.enchantedlearning.com/explorers/page/s/smith.shtml&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFsQFjAO&usg=AFQjCNEWo4pji9pBq89XmlprWg2okGHl5g
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:zs0buZvw9N8J:http://www.enchantedlearning.com/explorers/page/s/smith.shtml%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF4QIDAO&usg=AFQjCNEu0cbayJymDVJ4IfbRc_NtrEtaPA
/search?ie=UTF-8&q=related:www.enchantedlearning.com/explorers/page/s/smith.shtml+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF8QHzAO
/search?ie=UTF-8&q=john+smith+texture+pack&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGIQ1QIoAA
/search?ie=UTF-8&q=john+smith+and+pocahontas&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGMQ1QIoAQ
/search?ie=UTF-8&q=john+smith+actor&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGQQ1QIoAg
/search?ie=UTF-8&q=john+smith+realty&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGUQ1QIoAw
/search?ie=UTF-8&q=john+smith+doctor+who&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGYQ1QIoBA
/search?ie=UTF-8&q=captain+john+smith&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGcQ1QIoBQ
/search?ie=UTF-8&q=john+smith+wrestler&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGgQ1QIoBg
/search?ie=UTF-8&q=john+smith+wrestling&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGkQ1QIoBw
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=20&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=30&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=40&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=50&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=60&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=70&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=80&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=90&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N
/advanced_search?q=John+Smith&ie=UTF-8&prmd=ivnsp
/support/websearch/bin/answer.py?answer=134479&hl=en
/tools/feedback/survey/html?productId=196&query=John+Smith&hl=en
/
/intl/en/ads
/services
/intl/en/policies/
/intl/en/about.html
array(0) { }

1 个答案:

答案 0 :(得分:1)

您的代码存在的问题是,每次尝试匹配元素时,$matches都是新数组。

可能的解决方案:

$result = array();
foreach($html->find('a') as $element) {
    preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //try to match
    if(array_key_exists(1,$matches) && $matches[1] != "") { //if we found a match
        $result[] = $matches[1]; //push it to $results
    }
}
print_r($result);//print result

另一种方法当然是尝试在生成的HTML页面中找到某种标记。例如,您可以将HTML文档转换为XML然后对其进行分析。然而,这种方法的问题是, Google 可以随时修改它的页面布局,因此您需要重写算法。