我正在学习如何从网站上抓取数据,但我却陷入了困境。由于隐私问题,我无法在此发布链接,但我会尝试解释。
酒店1的评级:
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>
酒店2的评级:
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>
酒店评级3:
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>
像这样,有100家酒店各有不同的课程,所以我无法使用xpath或者我不太了解它。
我想要刮掉所有评级,即&#34; 3.5&#34;,&#34; 3.9&#34;,&#34; 4.2&#34;餐厅,但问题是每个评级有不同的阶级和不同的身份。
请我只是一个初学者,我想学习一些东西,所以有人能告诉我如何刮掉这些酒店的评级? 如果你能给我一个例子,那就太棒了.. `
答案 0 :(得分:1)
使用lxml
库
这将返回包含评级的所有divs
的列表。
import urllib2
from lxml import etree
html = urllib2.urlopen(url)
html_text = etree.HTML(html.read())
rating_list = html_text.xpath('//*[@class="right"]/div')
#rating_lst = html_text.xpath('//*[@class="right"]') # choose accordingly, I dont have full source-code so commented out
for rate in rating_list:
print rate.xpath('text()')
import urllib2
from lxml import etree
data = """
<div>
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>
</div>
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>
</div>
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>
</div>
</div>
"""
# html = urllib2.urlopen(url) #use these two lines if getting source from a url
# html_text = etree.HTML(html.read())
html_text = etree.HTML(data)
rating_list = html_text.xpath('//*[@class="right"]/div')
for rate in rating_list:
print rate.xpath('text()')[0].strip('\n\t ')
答案 1 :(得分:1)
您应该使用HTML解析器,有多种选择,但BeautifulSoup
是最容易使用和理解的选项。以下是获取div
类rating-div
元素文本的示例:
from bs4 import BeautifulSoup
data = """
<div>
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>
</div>
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>
</div>
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>
</div>
</div>
"""
soup = BeautifulSoup(data)
print [r.get_text(strip=True) for r in soup.find_all('div', attrs={'class': 'rating-div'})]
打印:
[u'3.5', u'3.9', u'4.2']