当我从此page的详细信息页抓取数据时,我遇到错误 scrapy.exceptions.NotSupported :我仍然可以获取包含少量网页的数据但是当我增加页面数量,scrapy运行但没有输出更多,它运行,不能停止。先谢谢!
页面有图片,但我不想抓取图片,可能回复内容不是文字。
这是错误
2017-02-18 15:35:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.google.com.my:443/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en> from <GET http://maps.google.com.my/maps?f=q&source=s_q&hl=en&q=bs+bio+science+sdn+bhd&vps=1&jsv=171b&sll=4.109495,109.101269&sspn=25.686885,46.318359&ie=UTF8&ei=jPeISu6RGI7kugOboeXiDg&cd=1&usq=bs+bio+science+sdn+bhd&geocode=FQdNLwAdEm4QBg&cid=12762834734582014964&li=lmd>
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://com> (failed 3 times): DNS lookup failed: address 'com' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.byunature> (failed 3 times): DNS lookup failed: address 'www.byunature' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.borneococonutoil.com> (failed 3 times): DNS lookup failed: address 'www.borneococonutoil.com' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://com>: DNS lookup failed: address 'com' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.byunature>: DNS lookup failed: address 'www.byunature' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.borneococonutoil.com>: DNS lookup failed: address 'www.borneococonutoil.com' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.google.com.my/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en&dg=dbrw&newdg=1> from <GET https://www.google.com.my:443/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en>
2017-02-18 15:35:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com.my/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en&dg=dbrw&newdg=1> (referer: http://www.bsbioscience.com/contactus.html)
2017-02-18 15:35:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html)
2017-02-18 15:35:41 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "D:\Scrapy\tutorial\tutorial\spiders\tu2.py", line 17, in parse
company = response.css('font:nth-child(3)::text').extract_first()
File "c:\python27\lib\site-packages\scrapy\http\response\__init__.py", line 97, in css
raise NotSupported("Response content isn't text")
NotSupported: Response content isn't text
2017-02-18 15:35:41 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-18 15:35:41 [scrapy.extensions.feedexport] INFO: Stored json feed (30 items) in: tu2.json
2017-02-18 15:35:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 55,
'downloader/exception_type_count/scrapy.exceptions.NotSupported': 31,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 24,
我的代码:
import scrapy
import json
from scrapy.linkextractors import LinkExtractor
# import LxmlLinkExtractor as LinkExtractor
class QuotesSpider(scrapy.Spider):
name = "tu2"
def start_requests(self):
baseurl = 'http://edirectory.matrade.gov.my/application/edirectory.nsf/category?OpenForm&query=product&code=PT&sid=BED1E22D5BE3F9B5394D6AF0E742828F'
urls = []
for i in range(1, 3):
urls.append(baseurl + "&page=" + str(i));
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
company = response.css('font:nth-child(3)::text').extract_first()
key3 = "Business Address";
key4 = response.css('tr:nth-child(4) td:nth-child(1) b::text').extract_first();
key5 = response.css('tr:nth-child(5) td:nth-child(1) b::text').extract_first();
value3 = response.css('tr:nth-child(3) .table-middle:nth-child(3)::text').extract_first();
value4 = response.css('tr:nth-child(4) td:nth-child(3)::text').extract_first();
value5 = response.css('tr:nth-child(5) td:nth-child(3)::text').extract_first();
# bla = {}
# if key3 is not None:
# bla[key3] = value3;
if value3 is not None:
json_data = {
'company' : company,
key3: value3,
key4: value4,
key5: value5,
};
yield json_data
# yield json.dumps(bla)
# detail page
count = 0;
for button in response.css('td td a'):
detail_page_url = button.css('::attr(href)').extract_first();
if detail_page_url is not None:
page_urls = response.urljoin(detail_page_url);
yield scrapy.Request(page_urls, callback=self.parse)
答案 0 :(得分:1)
jclass doubleClass = (*env)->FindClass(env, "java/lang/Double");
jmethodID doubleMethod =
(*env)->GetMethodID(env, doubleClass, "doubleValue", "()D");
jdouble d = (*env)->CallDoubleMethod(env, valObj, doubleMethod);
蜘蛛正在这里抓取pdf文件。您需要手动过滤掉这些内容或使用已经执行此操作的[scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html)
。
LinkExtractor
默认情况下,LinkExtractor会忽略很多非html文件,包括pdf - 请参阅source here for full list
对于您的代码示例,请尝试以下方法:
def parse(self, response):
url = 'someurl'
if '.pdf' not in url:
yield Request(url, self.parse2)
# or
le = LinkExtractor()
urls = le.extract_links(response)
for url in urls:
yield Request(url, self.parse2)