Question

我正在使用Scapy1.4通过指定一组网址来抓取网页中的内容。我需要帮助如何从页面中提取各种信息，即URL，标题，正文。

目前，我使用以下网址

https://healthlibrary.epnet.com/GetContent.aspx?token=3bb6e77f-7239-4082-81fb-4aeb0064ca19&chunkiid=32905

我的代码是

class gsapocSpider(BaseSpider):
    name = "gsapoc" 
    start_urls =["https://healthlibrary.epnet.com/GetContent.aspx?token=3bb6e77f-7239-4082-81fb-4aeb0064ca19&chunkiid=32905"] 
    def parse(self, response):
        responseSelector = Selector(response) 
        for sel in responseSelector.xpath('//ul/li'):
            item = GsapocItem()
            item['title'] = sel.xpath('//ul/li/a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['body'] = sel.xpath('//body//p//text()').extract()
            #item['text'] = sel.xpath('//text()').extract()
            #body = response.xpath('//body//p//text()').extract()
            #print(body)
            yield item

Answer 1

我无法理解为什么要设置XPath表达式。您的页面中甚至没有ul个元素。

因为您的目标只是获取网址，标题和正文。以下是一些建议：

URL。您可以response

response.url

标题。根据您要查找的标题类型，有两种选择：title标记和指定的元素。
车身。你想要整页还是只有文字？如果前者response.body没问题，如果是后者，则需要指定如何提取所有内容。

无论如何，我认为你需要一些HTML和XPath的知识。

感谢。

如何从Scrapy中提取网页中的所有内容

1 个答案: