找到'p'标签的数量并迭代它们以使用php刮取底层文本

时间:2013-04-10 07:43:23

标签: php curl web-scraping

所以我不知道如何从网站页面中删除段落的基础文本,其中没有任何使用php的“id”或“class”。 其中一种方法是计算并遍历a中的

标记,但在遇到任何

标记之前,div本身会被关闭。 我打算刮取wikitravel.org信息以学习刮擦。 这是wikitravel.org

页面源代码的示例之一
   <h2><span class="editsection">[<a href="/wiki/en/index.php?title=Kanniyakumari&    amp;action=edit&amp;section=18" title="Edit section: Sleep">edit</a>][<a href="#Sleep" title="click to add a sleep listing" onclick="addListing(this, '18', 'sleep', 'Kanniyakumari');">add listing</a>]</span> <span class="mw-headline" id="Sleep">Sleep</span></h2>

   <p>There are numerous hotels, residencies etc. in and around Kanyakumari and therefore, staying over is not be a problem. But there are agents, touts and brokers in every nook and corner looking for unsuspecting tourists. Eschew buying or booking rooms from them, as many a time you end up paying a lot more than the actual price. Vivekananda Kendra can be a good option for people looking for a decent, yet cheap accommodation, but it's around 3 km from Kanyakumari. Prefer hotels near the beach especially if you want to watch the sunrise right out of your bed! Note that, you should quote this preference when booking the room or else, you'll always be given a room without a window opening out to the sea. Moreover many a times, these rooms are in great demand and you'll find yourself shelling a extra 400 - 500 Rs (~10 US$)for such a room. Hotel Sea View, Hotel Sangam and a couple of other hotels offer such rooms and the rent is about Rs. 1100 (~ 25 US$) for 12 hrs. Note that many rooms are priced for 12 hrs  and not per day especially during the peak season.
</p>

<p>ATM's in Kanyakumari:</p>

 <p>Canara Bank 
 Main Road, Kanyakumari 629702, ,
 </p>
 <p>Indian Bank 
  S No 658 / 1, National High Way Opp St Antony'S Higher Secondary Sckanyakumari 629702
 </p>
<p>State Bank Of Travancore 
P.B.No.1, 1/17 Amman Sannathi Street, Kanyakumari, Tamil Nadu, 629702
</p>

任何人都可以帮忙吗? 提前致谢!

2 个答案:

答案 0 :(得分:0)

我总是发现JQuery是抓取HTML数据的最佳方式。让PHP使用JQuery呈现一个页面,该页面解析已删除的HTML,然后将JSON数据集发布回PHP。

如果您想坚持使用纯PHP路线,请尝试以下行:http://simplehtmldom.sourceforge.net/

答案 1 :(得分:0)

看看simplehtmldom解析器。它应该与类似jQuery的选择器一起使用。

以下是您案例的示例:

$html = file_get_html('http://www.wikitravel.com/yourpage');
foreach($html->find('p') as $element){
    echo $element->innertext; // the content in all the p tags
}