我正在试图搜索网站http://www.funda.nl/koop/amsterdam/,其中列出了阿姆斯特丹的待售房屋。主页面包含许多链接,其中一些链接指向待售的各个房屋。我想最终关注这些链接并从中提取数据。
首先,我试图简单地列出与个别房屋相对应的链接。我注意到他们的网址包含“huis-”后跟8位代码 - 例如http://www.funda.nl/koop/amsterdam/huis-49801910-claus-van-amsbergstraat-86/。我想使用正则表达式r'huis-\d{8}'
来匹配此子网址。
我正在尝试使用Scrapy的LinkExtractor
来执行此操作,但它似乎无法正常工作。我写的蜘蛛如下:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
from scrapy.shell import inspect_response
class FundaSpider(CrawlSpider):
name = "Funda"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le1 = LinkExtractor()
rules = (
Rule(LinkExtractor(allow=r'huis-\d{8}'), callback='parse_item'),
)
def parse_item(self, response):
links = self.le1.extract_links(response)
for link in links:
item = FundaItem()
item['url'] = link.url
print("The item is "+str(item))
yield item
在主项目目录中,如果我运行scrapy crawl Funda -o funda.json
,则生成的funda.json
文件将以以下行开头:
[
{"url": "http://www.funda.nl/cookiebeleid/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/ufsavqdqfvxyerrvff.html"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/"},
{"url": "http://www.funda.nl/koop/"},
{"url": "https://www.funda.nl/mijn/login/?ReturnUrl=%2Fkoop%2Famsterdam%2Fhuis-49728947-emmy-andriessestraat-374%2F"},
{"url": "https://www.funda.nl/mijn/aanmelden/?ReturnUrl=%2Fkoop%2Famsterdam%2Fhuis-49728947-emmy-andriessestraat-374%2F"},
{"url": "http://www.funda.nl/language/switchlanguage/?language=en&returnUrl=%2Fkoop%2Famsterdam%2Fhuis-49728947-emmy-andriessestraat-374%2F"},
{"url": "https://help.funda.nl/hc/nl/categories/200207038"},
{"url": "http://www.funda.nl/koop/amsterdam/"},
正如您所看到的,它包含多行,其中包含没有“huis-”的链接或8位数的代码。我怎样才能将其过滤到只有“正版”的房屋链接?
答案 0 :(得分:1)
问题在于正则表达式位于rules
参数的定义中,而不是le1
的定义中。将其添加到le1
的定义会使输出符合预期。