我得到的结果如下:
[crawler] DEBUG: Crawled (200) <GET http://www.hormelfoods.com/About/Legal/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/link.aspx?_id=EFFFBF3348524C6ABCD1C2775E7FDA93&_z=z> (referer: http://www.hormelfoods.com/About/Legal/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/~/link.aspx?_id=3FC7ECFD861B4F1AAF7BFD218236F983&_z=z)
我看到了引用者的页面源:
它显示了此<a href="~/link.aspx?_id=EFFFBF3348524C6ABCD1C2775E7FDA93&_z=z">
如何纠正这个?
我添加了一个计数器,用于跟踪解析的pdf数量。
我的parse_item功能:
def parse_item(self, response):
sel = HtmlXPathSelector(response)
for utype in self.url_types:
links = []
# if sel.select('/html/head/link[@type="application/pdf"]/@href').extract():
# links += sel.select('/html/head/link[@type="application/pdf"]/@href').extract()
if sel.xpath('//a[contains(@href, "{0}")]/@href'.format(utype)).extract():
links += sel.xpath('//a[contains(@href, "{0}")]/@href'.format(utype)).extract()
if sel.xpath('/html/head/link[@type="application/{0}"]/@href'.format(utype)).extract():
links += sel.xpath('/html/head/link[@type="application/{0}"]/@href'.format(utype)).extract()
# if sel.select('/html/head/link[@type="application/x-pdf"]/@href').extract():
# links += sel.select('/html/head/link[@type="application/x-pdf"]/@href').extract()
items = []
self.cntr += len(links)
if(self.cntr > 60):
raise CloseSpider('links exceeded')
for link in links:
item = CrawlerItem()
item['main'] = response.url
base_url = get_base_url(response)
item['url'] = urljoin(base_url,link)
company = tldextract.extract(base_url)
item['source'] = company.domain
item['type'] = utype.upper()
yield item
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links