Question

我正在尝试抓取this page。

我希望使用Scrapy从给定网站获取所有链接

我正在尝试这种方式 -

import scrapy
import unidecode
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html


class ElementSpider(scrapy.Spider):
    name = 'linkdata'

    start_urls = ["https://www.goodreads.com/list/show/19793.I_Marked_My_Calendar_For_This_Book_s_Release",]


    def parse(self, response):

        links = response.xpath('//div[@id="all_votes"]/table[@class="tableList js-dataTooltip"]/div[@class="js-tooltipTrigger tooltipTrigger"]/a/@href').extract()
        print links

但是我的输出没有任何结果。

Answer 1

我认为你的xpath已经过了。试试这个 -

for href in response.xpath('//div[@id="all_votes"]/table[@class="tableList js-dataTooltip"]/tr/td[2]/div[@class="js-tooltipTrigger tooltipTrigger"]/a/@href'):       
            full_url = response.urljoin(href.extract())
            print full_url

希望有所帮助：）

祝你好运......

如何使用scrapy从页面中提取所有href内容

1 个答案: