我试图抓取网站http://www.funda.nl/koop/amsterdam/,其中列出了阿姆斯特丹的待售房屋,并从个别住宅的http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/等子页面中提取数据。作为第一步,我首先要获得所有这些子页面的列表。到目前为止,我有以下蜘蛛:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
from scrapy.shell import inspect_response
class FundaSpider(CrawlSpider):
name = "Funda"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % start_urls[0], allow_domains='funda.nl')
rules = (
Rule(le1, callback='parse_item'),
)
def parse_item(self, response):
links = self.le1.extract_links(response)
for link in links:
item = FundaItem()
item['url'] = link.url
print("The item is "+str(item))
yield item
如果我将此生成JSON输出作为scrapy crawl Funda -o funda.json
运行,那么生成的funda.json
看起来像这样(仅前几行):
[
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/ywavcsbywacbcasxcxq.html"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/print/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/reageer/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/bezichtiging/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/brochure/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/doorsturen/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/meld-een-fout/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/ywavcsbywacbcasxcxq.html"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/print/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/reageer/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/bezichtiging/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/brochure/download/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/doorsturen/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/meld-een-fout/"},
除了所需的子页面http://www.funda.nl/koop/amsterdam/huis-49801360-brede-vogelstraat-2/
和http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/
之外,还有许多子页面'我没有打算选择。我怎么才能选择子页面?
答案 0 :(得分:0)
现在我添加了一个if
语句,用于检查url
是否具有所需的正斜杠数(6)并以正斜杠结束:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
from scrapy.shell import inspect_response
class FundaSpider(CrawlSpider):
name = "Funda"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % start_urls[0])
rules = (
Rule(le1, callback='parse_item'),
)
def house_link(link):
url = link.url
return url.count('/') == 6 and url.endswith('/')
def parse_item(self, response):
links = self.le1.extract_links(response)
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'):
item = FundaItem()
item['url'] = link.url
print("The item is "+str(item))
yield item
现在scrapy crawl Funda -o funda.json
生成的JSON文件具有所需的有限数量的URL:
[
{"url": "http://www.funda.nl/koop/amsterdam/huis-49879212-henri-berssenbruggehof-15/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49713458-jan-vrijmanstraat-29/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49818887-markiespad-19/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801593-jf-berghoefplantsoen-2/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49890140-talbotstraat-9/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49805292-nieuwendammerdijk-21/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801910-claus-van-amsbergstraat-86/"}
][
{"url": "http://www.funda.nl/koop/amsterdam/huis-49713458-jan-vrijmanstraat-29/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49701161-johannes-vermeerstraat-16/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49879212-henri-berssenbruggehof-15/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801593-jf-berghoefplantsoen-2/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49805292-nieuwendammerdijk-21/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49890140-talbotstraat-9/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801910-claus-van-amsbergstraat-86/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49818887-markiespad-19/"}
]
我欢迎更优雅的解决方案!在我看来,确定URL的链接深度是一项常见的任务,已经存在方法/模块。