使用Scrapy的LinkExtractor

时间:2016-07-13 12:58:39

标签: python scrapy

我试图抓取网站http://www.funda.nl/koop/amsterdam/,其中列出了阿姆斯特丹的待售房屋,并从个别住宅的http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/等子页面中提取数据。作为第一步,我首先要获得所有这些子页面的列表。到目前为止,我有以下蜘蛛:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
from scrapy.shell import inspect_response

class FundaSpider(CrawlSpider):
    name = "Funda"
    allowed_domains = ["funda.nl"]
    start_urls = ["http://www.funda.nl/koop/amsterdam/"]

    le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % start_urls[0], allow_domains='funda.nl')
    rules = (
    Rule(le1, callback='parse_item'),
    )

    def parse_item(self, response):
        links = self.le1.extract_links(response)
        for link in links:
            item = FundaItem()
            item['url'] = link.url
            print("The item is "+str(item))
            yield item

如果我将此生成JSON输出作为scrapy crawl Funda -o funda.json运行,那么生成的funda.json看起来像这样(仅前几行):

[
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/ywavcsbywacbcasxcxq.html"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/print/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/reageer/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/bezichtiging/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/brochure/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/doorsturen/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/meld-een-fout/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/ywavcsbywacbcasxcxq.html"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/print/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/reageer/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/bezichtiging/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/brochure/download/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/doorsturen/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/meld-een-fout/"},

除了所需的子页面http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/之外,还有许多子页面'我没有打算选择。我怎么才能选择子页面?

1 个答案:

答案 0 :(得分:0)

现在我添加了一个if语句,用于检查url是否具有所需的正斜杠数(6)并以正斜杠结束:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
from scrapy.shell import inspect_response

class FundaSpider(CrawlSpider):
    name = "Funda"
    allowed_domains = ["funda.nl"]
    start_urls = ["http://www.funda.nl/koop/amsterdam/"]

    le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % start_urls[0])
    rules = (
    Rule(le1, callback='parse_item'),
    )

    def house_link(link):
        url = link.url
        return url.count('/') == 6 and url.endswith('/')

    def parse_item(self, response):
        links = self.le1.extract_links(response)
        for link in links:
            if link.url.count('/') == 6 and link.url.endswith('/'):
                item = FundaItem()
                item['url'] = link.url
                print("The item is "+str(item))
                yield item

现在scrapy crawl Funda -o funda.json生成的JSON文件具有所需的有限数量的URL:

[
{"url": "http://www.funda.nl/koop/amsterdam/huis-49879212-henri-berssenbruggehof-15/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49713458-jan-vrijmanstraat-29/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49818887-markiespad-19/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801593-jf-berghoefplantsoen-2/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49890140-talbotstraat-9/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49805292-nieuwendammerdijk-21/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801910-claus-van-amsbergstraat-86/"}
][
{"url": "http://www.funda.nl/koop/amsterdam/huis-49713458-jan-vrijmanstraat-29/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49701161-johannes-vermeerstraat-16/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49879212-henri-berssenbruggehof-15/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801593-jf-berghoefplantsoen-2/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49805292-nieuwendammerdijk-21/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49890140-talbotstraat-9/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801910-claus-van-amsbergstraat-86/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49818887-markiespad-19/"}
]

我欢迎更优雅的解决方案!在我看来,确定URL的链接深度是一项常见的任务,已经存在方法/模块。