Question

我正在尝试从此处导航到每个县，然后每个县的每个城市： http://www.accountant-finder.com/CA/California-accountants.html

我的代码打开了上面列出的主页，按照解析器功能刮了标题，但似乎没有将规则应用到以“ / CA /”开头的县链接（相对路径）（如CA / Alameda / Alameda_county-California-accountants.html）。

我尝试使用各种正则表达式修改规则都无济于事。我想念什么？

import scrapy
from scrapy.spiders import CrawlSpider,Rule
from acctfinder.items import Accountant
from scrapy.linkextractors import LinkExtractor


class AccountantSpider(CrawlSpider):
    name = "Accountant"
    allowed_domains = ["accountant-finder.com"]
    start_urls = ["http://www.accountant-finder.com/CA/California-accountants.html"]
    rules =(Rule(LinkExtractor(allow=('\/CA\/.*',)),callback="parse_item",follow=True),)

    def parse(self,response):
        item = Accountant()
        title = response.xpath('//h1/text()')[0].extract()
        print("title is: "+title)
        item['title'] = title
        return item

Answer 1

这是使用CrawlSpider时的常见错误。仔细检查它指定的文档，您shouldn't be using the parse method。

关于蜘蛛的另一件事，该规则指定应使用parse_item方法处理每个项目。因此只需将parse方法更改为parse_item，它应该就会开始工作。

Scrapy Crawlspider不爬行是RegEx吗？

1 个答案: