使用xpath和domdocuments刮擦

时间:2014-07-25 18:08:11

标签: python html xpath web-scraping html-parsing

我正在学习如何从网站上抓取数据,但我却陷入了困境。由于隐私问题,我无法在此发布链接,但我会尝试解释。

酒店1的评级:

<div class = "right">
    <div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
                                           3.5
                 </div>

酒店2的评级:

<div class = "right">
    <div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
                                           3.9
                 </div>

酒店评级3:

<div class = "right">
    <div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
                                           4.2
                 </div>

像这样,有100家酒店各有不同的课程,所以我无法使用xpath或者我不太了解它。

我想要刮掉所有评级,即&#34; 3.5&#34;,&#34; 3.9&#34;,&#34; 4.2&#34;餐厅,但问题是每个评级有不同的阶级和不同的身份。

请我只是一个初学者,我想学习一些东西,所以有人能告诉我如何刮掉这些酒店的评级? 如果你能给我一个例子,那就太棒了.. `

2 个答案:

答案 0 :(得分:1)

使用lxml

这将返回包含评级的所有divs的列表。

import urllib2
from lxml import etree

html = urllib2.urlopen(url)
html_text = etree.HTML(html.read())
rating_list = html_text.xpath('//*[@class="right"]/div') 
#rating_lst = html_text.xpath('//*[@class="right"]')  # choose accordingly, I dont have full source-code so commented out

for rate in rating_list:
     print rate.xpath('text()')

给定样本数据的代码

import urllib2
from lxml import etree

data = """
<div>
    <div class = "right">
        <div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
                                               3.5
                     </div>
    </div>
    <div class = "right">
        <div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
                                               3.9
                     </div>
    </div>
    <div class = "right">
        <div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
                                               4.2
                     </div>
    </div>
</div>
"""

# html = urllib2.urlopen(url)         #use these two lines if getting source from a url
# html_text = etree.HTML(html.read())  

html_text = etree.HTML(data)
rating_list = html_text.xpath('//*[@class="right"]/div') 

for rate in rating_list:
     print rate.xpath('text()')[0].strip('\n\t ')

答案 1 :(得分:1)

您应该使用HTML解析器,有多种选择,但BeautifulSoup是最容易使用和理解的选项。以下是获取divrating-div元素文本的示例:

from bs4 import BeautifulSoup

data = """
<div>
    <div class = "right">
        <div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
                                               3.5
                     </div>
    </div>
    <div class = "right">
        <div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
                                               3.9
                     </div>
    </div>
    <div class = "right">
        <div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
                                               4.2
                     </div>
    </div>
</div>
"""

soup = BeautifulSoup(data)
print [r.get_text(strip=True) for r in soup.find_all('div', attrs={'class': 'rating-div'})]

打印:

[u'3.5', u'3.9', u'4.2']