Question

我得到的结果如下：

[crawler] DEBUG: Crawled (200) <GET http://www.hormelfoods.com/About/Legal/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/link.aspx?_id=EFFFBF3348524C6ABCD1C2775E7FDA93&_z=z> (referer: http://www.hormelfoods.com/About/Legal/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/link.aspx?_id=3FC7ECFD861B4F1AAF7BFD218236F983&_z=z)

我看到了引用者的页面源：它显示了此<a href="~/link.aspx?_id=EFFFBF3348524C6ABCD1C2775E7FDA93&_z=z">

如何纠正这个？

我添加了一个计数器，用于跟踪解析的pdf数量。

我的parse_item功能：

def parse_item(self, response):
    sel = HtmlXPathSelector(response)

    for utype in self.url_types:
        links = []

        # if sel.select('/html/head/link[@type="application/pdf"]/@href').extract():
        #     links += sel.select('/html/head/link[@type="application/pdf"]/@href').extract()
        if sel.xpath('//a[contains(@href, "{0}")]/@href'.format(utype)).extract():
            links += sel.xpath('//a[contains(@href, "{0}")]/@href'.format(utype)).extract()

        if sel.xpath('/html/head/link[@type="application/{0}"]/@href'.format(utype)).extract():
            links += sel.xpath('/html/head/link[@type="application/{0}"]/@href'.format(utype)).extract()
        # if sel.select('/html/head/link[@type="application/x-pdf"]/@href').extract():
        #     links += sel.select('/html/head/link[@type="application/x-pdf"]/@href').extract()

        items = []
        self.cntr += len(links)
        if(self.cntr > 60):
          raise CloseSpider('links exceeded')
        for link in links:
           item = CrawlerItem()
           item['main'] = response.url
           base_url = get_base_url(response)
           item['url'] = urljoin(base_url,link)
           company = tldextract.extract(base_url)
           item['source'] = company.domain
           item['type'] = utype.upper()
           yield item

  def process_links(self,links):
    for i, w in enumerate(links):
        w.url = w.url.replace("../", "")
        links[i] = w
    return links

scrapy中的网址不完整

0 个答案: