Scrapy crawlSpider规则 - 优先处理“下一页”

时间:2015-11-12 19:11:05

标签: web-scraping web-crawler scrapy

我试图从以下地方搜集旧金山所有酒店的列表: http://www.tripadvisor.com/Hotels-g60713-San_Francisco_California-Hotels.html

“下一家酒店”有独特的网址:

第2页是:/ Hotels-g60713-oa30-San_Francisco_California-Hotels.html

第3页是:/ Hotels-g60713-oa60-San_Francisco_California-Hotels.html

第4页是:/ Hotels-g60713-oa90-San_Francisco_California-Hotels.html

依旧......

  1. 如何设置crawlSpider以访问这些页面
  2. 在这种情况下是否有规则可以帮助我?
  3. 有没有办法确定优先顺序并使其在其他任何内容之前抓取并解析这些页面?
  4. 到目前为止我的代码:

    导入beatSoup_test     进口scrapy     来自scrapy.contrib.spiders导入CrawlSpider,Rule     来自scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor

    class TriAdvSpider(CrawlSpider):
        name = "tripAdv"
        allowed_domains = ["tripadvisor.com"]
        start_urls = [
        "http://www.tripadvisor.com/Hotels-g60713-San_Francisco_California-Hotels.html"
        ]
        rules = (
            Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
        )
    
    
        def parse_item(self, response):
            beatSoup_test.getHotels(response.body_as_unicode())
    

    其中beatSoup_test是我使用beautifulsoup的解析函数。 谢谢!

1 个答案:

答案 0 :(得分:1)

如果您想从任何页面中删除数据。使用Xpath 这样你就可以在同一页上刮掉任何东西。

并使用项目存储已删除的数据,以便您可以抓取任意数量的内容。

以下是如何使用它的示例。

sites = Selector(text=response.body).xpath('//div[contains(@id, "identity")]//section/div/div/h3/a/text()')
    items = []
    items = myspiderBotItem()
    items['title'] = sites.xpath('/text()').extract()

喜欢这个

class TriAdvSpider(CrawlSpider):
    name = "tripAdv"
    allowed_domains = ["tripadvisor.com"]
    start_urls = [
    "http://www.tripadvisor.com/Hotels-g60713-San_Francisco_California-Hotels.html"
    ]
    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
    )


    def parse_item(self, response):
        # beatSoup_test.getHotels(response.body_as_unicode())
        l = XPathItemLoader(item = TriAdvItem(),response = response)
        for i in range(1,8):

            l.add_xpath('day','//*[@id="super-container"]/div/div[1]/div[2]/div[2]/div[1]/table/tbody/tr['+str(i)+']/th[@scope="row"]/text()')
            l.add_xpath('timings1','//*[@id="super-container"]/div/div[1]/div[2]/div[2]/div[1]/table/tbody/tr['+str(i)+']/td[1]/span[1]/text()')
            l.add_xpath('timings2','//*[@id="super-container"]/div/div[1]/div[2]/div[2]/div[1]/table/tbody/tr['+str(i)+']/td[1]/span[2]/text()')
        return l.load_item()