打印网页的某些文档元素的所有出现

时间:2014-10-01 12:06:41

标签: python web-scraping beautifulsoup

所以我正在抓取这个特定的网页https://www.zomato.com/srijata,用于#34; Sri"发布的所有"餐厅评论"(不是她自己评论的自我评论)。

zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata')
zomato_info = zomato_ind.read()
open('zomato_info.html', 'w').write(zomato_info)
soup = BeautifulSoup(open('zomato_info.html'))
soup.find('div','mtop0 rev-text').text

这打印了她的第一次餐厅评论,即 - " Sri回顾了Big Straw - Chew On This" as: -

 u'Rated  This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..'

我还尝试了另一个选择器: -

我有这样的问题: -

如何打印下一次餐厅评论?我尝试了findNextSiblings等,但似乎都没有。

2 个答案:

答案 0 :(得分:1)

首先,您不需要将输出写入文件,将urlopen()调用的结果传递给BeautifulSoup构造函数。

要获得评论,您需要使用类div迭代所有rev-text标记,并获取div元素的.next_sibling

import urllib2
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('https://www.zomato.com/srijata'))
for div in soup.find_all('div', class_='rev-text'):
    print div.div.next_sibling

打印:

This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..

The ambience is good. The food quality is good. I Didn't find anything to complain. I wanted to visit the place fir a very long time and had dinner today. The meals are very good and if u want the better quality compared to other Andhra restaurants then this is the place. It's far better than nandhana. The staffs are very polite too. 

...

答案 1 :(得分:0)

你应该创建一个for循环并使用find_all而不是find:

zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata')
zomato_info = zomato_ind.read()
open('zomato_info.html', 'w').write(zomato_info)
soup = BeautifulSoup(open('zomato_info.html'))
for div in soup.find_all('div','rev-text'):
    print div.text

还有一个问题:为什么要将html保存在文件中,然后将文件读入汤对象?