使用xpath刮到下一页

时间:2018-08-29 04:31:13

标签: python-3.x xpath scrapy scrapy-spider

我创建了一个蜘蛛来从网站上抓取数据。没关系,直到我添加了带有规则的爬网蜘蛛以使它继续下一页。我猜我在Rule中的xpath是错误的。你能帮我解决吗?附:我正在使用python3

这是我的蜘蛛:

import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.selector import Selector
from task11.items import Digi

class tutorial(CrawlSpider):
    name = "task11"
    allowed_domains = ["meetings.intherooms.com"]
    start_urls = ["https://meetings.intherooms.com/meetings/aa/al"]

    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('(//a[@class="prevNext" and contains(text(),"Next")])[1]')),callback="parse_page", follow=True),)

    def parse_page(self, response):
        sel = Selector(response)
        sites = sel.xpath('//*[@class="all-meetings"]/tr')
        items = []

        for site in sites[1:]:
            item = Digi()
            item['meeting_title'] = site.xpath('td/text()').extract()
            items.append(item)
        return items

这是我对第一页进行转义后获得的预期结果(并希望从下一页获得更多):

2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
                   'SELMA,  ',
                   'TUESDAY',
                   '7:00 PM',
                   'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
                   'SELMA,  ',
                   'THURSDAY',
                   '7:00 PM',
                   'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
                   'SELMA,  ',
                   'SUNDAY',
                   '7:00 PM',
                   'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['210 Lauderdale Street',
                   'SELMA,  36703',
                   'MONDAY',
                   '6:00 PM',
                   'Alcoholics Anonymous']}

2 个答案:

答案 0 :(得分:0)

我将使用“下一页”按钮的类:

response.xpath('//a[@class="prevNext"]/@href')

其中有2个结果。一个用于顶部,另一个用于按钮箭头。 但是,当您打开下一页的第一页(第二页)时,上一页也会获得带有prevNext类的链接。 这不是一个大问题,因为scrapy会过滤掉大多数其他请求。 但是可以使用文本过滤器来限制链接:

response.xpath('//a[contains(text(),"Next")]/@href')

或者如果您不确定Next是否也位于其他链接中,则可以将它们组合在一起:

response.xpath('//a[@class="prevNext" and contains(text(),"Next")]/@href')

答案 1 :(得分:0)

您需要将其用于restrict_xpaths(不是链接的文本或href,而是链接节点本身):

restrict_xpaths='(//a[@class="prevNext" and contains(text(),"Next")])[1]'