我想将我的谷歌浏览器书签放到数据库中,所以我的第一步是从Chrome中使用chrome导出.html文件并将数据转换为变量,我希望得到一些能够获得的PHP代码运行下面的数据,它将URL,ADD_DATE,ICON和链接文本全部提取到自己的变量中。
我知道我需要使用一些正则表达式,有人可以帮忙吗?谢谢,我会在时间允许的情况下为此添加赏金。
<A HREF="http://snipt.net/public/tag/css"
ADD_DATE="1271801059"
ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACtklEQVQ4jXWSS2gTURSGvzszyaSxpsS2vhe2WosgilgVHyDqzo2iIoog+EIKCiIuFNTGjUoVBLWCiKArFcSFi7hQFLT4Qqp10SK11mKbgk3SmjSdJDNzj4s+0Fb/zTmL/3z8596jmKDElxcVYTuwxS3+Gu7O9DysqzvsTvT8KfVnP9DdvBfRZ3w3N197DqGAepV2AyePPuj9FDKNGUZBG68/dzo/Hjcm/gL0dcQrS4KRO9pzNvt+EdvUDOVdWr6lSKSdYUeFr39NhuNdP7N2KvNrZti21brF856eO7AloQAGul40iHgx3ysQsoNXP3Znih/avp6YX2lSXWESDRvprFe2fNHqfd8BdsduViQzxQ19mcxLAwAxporWKKXwXIyQJWxdMZu1i2YTjUTxsKeV2dlLsVjMALgXO5yMRqYMhE1zpjW6SBalQBSuXziyoNzC9UPk3QJaRsFa7QjOil5YWX/15Yqa6VYinc3m0vl2C0BEJxUKQQCh6Gu074MIIoIWjWhh55LipkiopDGpnVzT8UN5AGskgDRjmL74YooWEI2IIGhAA4IWQWD55prc1uo1R26P/YIBEK3e2KoM+5HCGB8ADTJSR2CC1oInXqz92anyvwAAnngNygrmRDQylmC8CogQDviIl5v7NrXg9CRAxbz17UpZTUqZiOjRNUYAQVMzNeDQ0muyL76Jg893Hdt+Y2jJ+BuMqeANXw5YJXs8d2iOiGAqTant0tVf5Mr7Wu53rsOX6ZSEvZ62nqyeeMoAJDuf1nvO4A2bQTLOMHdbolxrXUV/fiGEKFRFBm5VlfZffH66tvefgI6OuF0u7pt4a2pZ47vFfE4thWCQytLck9qy/nPNZ6veTZyZpPP3m7cF6n8K+0VKjxba6xp6d/3POynBmJaed07afs4s+tmmT7Gqwf/5fgMaeWl1u/QPfAAAAABJRU5ErkJggg=="
>Snipt - public - css | Share and store code or command snippets.</A>
我喜欢用户yc建议使用类似这样的东西而不是正则表达式
$s = '<A HREF="http://snipt.net/public/tag/css"
ADD_DATE="1271801059"
ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACtklEQVQ4jXWSS2gTURSGvzszyaSxpsS2vhe2WosgilgVHyDqzo2iIoog+EIKCiIuFNTGjUoVBLWCiKArFcSFi7hQFLT4Qqp10SK11mKbgk3SmjSdJDNzj4s+0Fb/zTmL/3z8596jmKDElxcVYTuwxS3+Gu7O9DysqzvsTvT8KfVnP9DdvBfRZ3w3N197DqGAepV2AyePPuj9FDKNGUZBG68/dzo/Hjcm/gL0dcQrS4KRO9pzNvt+EdvUDOVdWr6lSKSdYUeFr39NhuNdP7N2KvNrZti21brF856eO7AloQAGul40iHgx3ysQsoNXP3Znih/avp6YX2lSXWESDRvprFe2fNHqfd8BdsduViQzxQ19mcxLAwAxporWKKXwXIyQJWxdMZu1i2YTjUTxsKeV2dlLsVjMALgXO5yMRqYMhE1zpjW6SBalQBSuXziyoNzC9UPk3QJaRsFa7QjOil5YWX/15Yqa6VYinc3m0vl2C0BEJxUKQQCh6Gu074MIIoIWjWhh55LipkiopDGpnVzT8UN5AGskgDRjmL74YooWEI2IIGhAA4IWQWD55prc1uo1R26P/YIBEK3e2KoM+5HCGB8ADTJSR2CC1oInXqz92anyvwAAnngNygrmRDQylmC8CogQDviIl5v7NrXg9CRAxbz17UpZTUqZiOjRNUYAQVMzNeDQ0muyL76Jg893Hdt+Y2jJ+BuMqeANXw5YJXs8d2iOiGAqTant0tVf5Mr7Wu53rsOX6ZSEvZ62nqyeeMoAJDuf1nvO4A2bQTLOMHdbolxrXUV/fiGEKFRFBm5VlfZffH66tvefgI6OuF0u7pt4a2pZ47vFfE4thWCQytLck9qy/nPNZ6veTZyZpPP3m7cF6n8K+0VKjxba6xp6d/3POynBmJaed07afs4s+tmmT7Gqwf/5fgMaeWl1u/QPfAAAAABJRU5ErkJggg=="
>Snipt - public - css | Share and store code or command snippets.</A>';
$bookmarks = simplexml_load_string($s2);
echo $bookmarks["HREF"]; //URL
echo '<br>';
echo $bookmarks[0]; //Name
echo '<br>';
echo $bookmarks['ICON']; //Icon
echo '<br>';
echo $bookmarks['ADD_DATE']; //Add_Date
但是我还没想出如何让它在html页面或字符串上使用多个链接。
然后我找到了这个PHP DOMDocument类,我似乎让它像这样......
$html = '<DT><A HREF="http://stackapps.com/questions/518/stacktack-a-javascript-widget-you-can-stick-anywhere" ADD_DATE="1301274664" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACY0lEQVQ4jX2SS0jVQRTGv//MnJnxGjctpbCFIrgO2rRr06KiRdtKEYLwUj4gohdBZlFUEmmp0N8WIZXrIMiF25Zu27hQIaKiuHaze/93ni3ykY/bWR3O/M43zDdfghp14fpEo67HZwCx+ONby8uR28s7cayWQH0DP6G01lrruqb9LcdrcclaMzA0doqIHnESrxQnmfBkgHGeTwD4EEoxhKfWWuOc7/LGXH0y2Pd2XaCn5znl2zBPUraSJAghwAUHYxwAEGOAcw7eeVhjYapmqbQYO9K0YBkApGnBMqJJIoKSEkorKK2hlCwpJUtKa2itIZUEEUGQmEzTgt3kAZHgggSEJBBRJjjvz4Jpz4Jp5yT6BVG2ugxOgq97cO7G/ea9u5uO5nK5CaVVo9IanLG+S10nx/81a3R6ptd7N5ZVMmSVbLlcznpXvhdnWUN9wxIJMc0Za2SMgyVJyVTd1Fa3s3J1KkmSEmMcgvMGKfjrXOOuxe3fmGybbD9PNjhWWf7dZl3odCEshxAQI/JSqe6tezqnumNEPvgA50PRene2uFJsXb/v5sjULaXkkNYagihjgl8pV8vTAJBTuTPBh2Fnra5mGYyxg3cHuu4AgFgT8NZ5zzmssYiAFjE+qxPqHgAE5/PeOVhj4Z2DrTq/6cV/g5TMSyVbSQoIQRtBSoAY1oLkVoNkl34uhI40LVgOAHNz78LhI8cWInAohjCKGD947w+GELT3DtbYkvN+2JjqrLXugDfm8vjjix//6/m1By9OM+LTABCs73x4/fybnTix0xAAvn75NLOneV8FQFL5Fd/X4v4ArZQWGyLoDDcAAAAASUVORK5CYII=">app - StackTack, a JavaScript widget you can stick anywhere - Stack Apps</A>
<DT><A HREF="https://chrome.google.com/extensions/detail/paoeolblihedcagbofkkkecjilmpehmo" ADD_DATE="1301275461" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADSklEQVQ4jW2TX2gbBQDGf5e7NGmTlOZ/xKVtbOfWrjWUNa6dOB9ay4aMqvgwEPRFRYTBGEPwwRcnfVB8ERQn+CBS6XSoVAWZImMWOttJtcZ2bdpm69qkjfnTXq7J3SV358uEyvzePvh9fC/fJ/A/OptIPPH6yMjzmq4PIgh2h9O5tZTPf33uypWJZC63vZ8V9ptwRzj0WvvhCy9EDrxUmJnxWsUilmUhOJ1IPp9VdjqTP+Ryb4+tr39xX+vQqWPdE9Pjy99cOG9lgiHrjiham0e6rc1HE9ZGLGYtgbUA1prTab3n8bzxb04ECAaD7g8n35z0NAd62FA5fPsuW0+d4nqsk6Q/TPlQB13xOL70bWyKwqDbPeR7su3Pq0v5RRFg9NzJs9E+x4t2a49SYxMlVxsflyDde4Z08DG+N0PoITcBu4F5J0tDRSMx8HBivLjzmRgOh13D50ferRibDzrRiQRb+fTaCguto4Tb++jv9qO5Olis5BG9axSG27jhCrArNHkdsZ1ZKXikNS43GO3ynkLE4aRumeQNGaVSIyfryB4RRQW5mCXQXufMcxmKp7P8nowSmbGdkMpq+YGNvVKT3aaTrZSJtuzQdTDC4i+XmZNcpIoRasVlfDevEjvmg/IWLUKGwa46yXkrIlk1dK1u1/NmzeW3a7QUFxgZOUEm+zPJ6TF2qh58lswrz/TQfzSDJWdQa40oio2KqqnS1tpmqjOnbZf8dm9KqSAKINqnOf10goOODR55qJGe7iY6W1dg5ztqpoSqSqiawPyt8i3RqOqy5LZ3O45G+zdKOUwRcms1WtLw8rMCfX2r+BzXYfcnTMNA0+wolWbWs5py8VJhzAboqW9/+9y5aeRw+5hf3WPIauXV0QLeph+hMAXKEqYpoGoOdmUPRt3kky/lr0ol/hAB0M2/95bzZufjvQNii6/heKiZ/tANqK6AUcUwLGp1G1W1Gb0mcWmiMPX+ePkikBbvLbKmFspr8uz6brg9FF0NBANoMQ55y0i2EjZTQNcl/kohv/XB7uRHl+V3gFnA/M+ZgCBw3N97YDg8EI93tYX8cVuqKuVuKlNz3P11Tp0pyFwDFoH6fW+8JwmIAFHAA9SAKpAHtoHyfvgfh8p7963YqU4AAAAASUVORK5CYII=">StackStalker - Google Chrome extension gallery</A>
<DT><A HREF="http://stackapps.com/questions/319/phpstack-a-php-wrapper-to-the-se-api" ADD_DATE="1301276371" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACY0lEQVQ4jX2SS0jVQRTGv//MnJnxGjctpbCFIrgO2rRr06KiRdtKEYLwUj4gohdBZlFUEmmp0N8WIZXrIMiF25Zu27hQIaKiuHaze/93ni3ykY/bWR3O/M43zDdfghp14fpEo67HZwCx+ONby8uR28s7cayWQH0DP6G01lrruqb9LcdrcclaMzA0doqIHnESrxQnmfBkgHGeTwD4EEoxhKfWWuOc7/LGXH0y2Pd2XaCn5znl2zBPUraSJAghwAUHYxwAEGOAcw7eeVhjYapmqbQYO9K0YBkApGnBMqJJIoKSEkorKK2hlCwpJUtKa2itIZUEEUGQmEzTgt3kAZHgggSEJBBRJjjvz4Jpz4Jp5yT6BVG2ugxOgq97cO7G/ea9u5uO5nK5CaVVo9IanLG+S10nx/81a3R6ptd7N5ZVMmSVbLlcznpXvhdnWUN9wxIJMc0Za2SMgyVJyVTd1Fa3s3J1KkmSEmMcgvMGKfjrXOOuxe3fmGybbD9PNjhWWf7dZl3odCEshxAQI/JSqe6tezqnumNEPvgA50PRene2uFJsXb/v5sjULaXkkNYagihjgl8pV8vTAJBTuTPBh2Fnra5mGYyxg3cHuu4AgFgT8NZ5zzmssYiAFjE+qxPqHgAE5/PeOVhj4Z2DrTq/6cV/g5TMSyVbSQoIQRtBSoAY1oLkVoNkl34uhI40LVgOAHNz78LhI8cWInAohjCKGD947w+GELT3DtbYkvN+2JjqrLXugDfm8vjjix//6/m1By9OM+LTABCs73x4/fybnTix0xAAvn75NLOneV8FQFL5Fd/X4v4ArZQWGyLoDDcAAAAASUVORK5CYII=">library - PHPstack - A PHP wrapper to the SE API - Stack Apps</A>
';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
echo 'Title = ' .$node->nodeValue. '</br>';
echo 'URL = ' .$node->getAttribute("href"). '</br>';
echo 'Icon = ' . $node->getAttribute("icon"). '</br>';
echo 'Date Added = ' . $node->getAttribute("add_date"). '</br>';
echo '<br>';
}
答案 0 :(得分:1)
请勿使用regex,因为即使Chrome提供的HTML也不是常用语言。
使用XML解析器,例如SimpleXML
。
如果上面的字符串是$s
,
$bookmarks = simplexml_load_string($s);
echo $bookmarks["HREF"]; //URL
echo $bookmarks[0]; //Name
object(SimpleXMLElement)#1(2){ [ “@属性”] =&GT; array(3){ [ “HREF”] =&GT;串(31) “http://snipt.net/public/tag/css” [ “ADD_DATE”] =&GT; string(10)“1271801059” [ “ICON”] =&GT;字符串(1026) “data:image / png; base64,iVBh .... =”} [0] =&GT; string(64)“Snipt - public - css |共享和存储代码或命令 片段。“}
答案 1 :(得分:1)
一般来说,这个基于PHP的html数据提取教程可能会对你有所帮助:
xpath
绝对是值得精通的,如果你必须使用html或xml一般。 W3schools有很好的参考资料:
答案 2 :(得分:1)
另一个选择(放弃PHP)是使用jQuery和CSS选择器。我更喜欢CSS选择器到xpath用于大多数目的,这种方法允许你利用精彩的SelectorGadget工具。
以下是最近的指南:http://blog.dtrejo.com/scraping-made-easy-with-jquery-and-selectorga
注意:它们链接到原始jQuerify。有一个actively maintained jQuerify Chrome extension和一个newer, better jQuerify。
SelectorGadget在this screencast约5:35开始演示。