我正在尝试获取该网站(https://minerals.usgs.gov/science/mineral-deposit-database/#products)的链接,并从每个网站中抓取标题。但是它不起作用!蜘蛛似乎未遵循链接!
代码
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from depositsusa.items import DepositsusaItem
from scrapy.loader import ItemLoader
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['web']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-
database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[@id="products"][1]/p/a'),
callback='parse'),
)
def parse(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[@class="container"][1]/header/h1/text()')
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()
项目
import scrapy
from scrapy.item import Item, Field
class DepositsusaItem(Item):
# main fields
name = Field()
# Housekeeping Fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
pass
输出
(base) C:\Users\User\Documents\Python WebCrawling Learing
Projects\depositsusa>scrapy crawl deposits
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot:
depositsusa)
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)],
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform
Windows-10-10.0.17134-SP0
2018-11-17 00:29:48 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'depositsusa', 'NEWSPIDER_MODULE': 'depositsusa.spiders', 'ROBOTSTXT_OBEY':
True, 'SPIDER_MODULES': ['depositsusa.spiders']}
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled downloader
middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-11-17 00:29:48 [scrapy.core.engine] INFO: Spider opened
2018-11-17 00:29:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2018-11-17 00:29:48 [scrapy.extensions.telnet] DEBUG: Telnet console
listening on 127.0.0.1:6024
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://minerals.usgs.gov/robots.txt> (referer: None)
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://minerals.usgs.gov/science/mineral-deposit-database/#products>
(referer: None)
2018-11-17 00:29:49 [scrapy.core.scraper] DEBUG: Scraped from <200
https://minerals.usgs.gov/science/mineral-deposit-database/>
{'date': [datetime.datetime(2018, 11, 17, 0, 29, 49, 832526)],
'project': ['depositsusa'],
'server': ['DESKTOP-9CUE746'],
'spider': ['deposits'],
'url': ['https://minerals.usgs.gov/science/mineral-deposit-database/']}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-17 00:29:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 475,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 25123,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 16, 23, 29, 49, 848053),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 16, 23, 29, 48, 520273)}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Spider closed (finished)
我是python的新手,那么似乎是什么问题呢?与linkextraction或parse函数有关吗?
答案 0 :(得分:1)
您必须更改几件事。
首先,当您使用CrawlSpider
时,将没有名为parse
的回调,因为它将覆盖CrawlSpider
的{{1}}:{{3} }
https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
其次,您希望获得parse
的正确列表。
尝试这样的事情:
allowed_domains