我正在尝试从本页列出的页面中抓取信息。 https://pardo.ch/pardo/program/archive/2017/catalog-films.html
xpath选择器:
film_page_urls_startpage = sel.xpath('//article[@class="strip-list_link_all strip-list strip--color row row--5"]/a/@href').extract()
正确地抓取所有23个网址。然而,蜘蛛似乎甚至没有尝试爬行所有23.它每次只爬11只相同的11。因为我正在使用硒,所以我可以看到它只是跳过第一页/网址而根本没有导航到它。什么给了?
这是我的代码:
from scrapy import Spider
from scrapy.http import Request
from selenium import webdriver
from scrapy.selector import Selector
from time import sleep
from selenium.common.exceptions import NoSuchElementException
from scrapy.loader import ItemLoader
from films_locarno.items import FilmsLocarnoItemfrom scrapy import
class FilmsLocarnoSpiderSpider(Spider):
name = 'films_locarno_spider'
allowed_domains = ['https://pardo.ch/']
start_urls = ['https://pardo.ch/pardo/program/archive/2017/catalog-films.html']
def start_requests(self):
self.driver = webdriver.Firefox()
self.driver.get('https://pardo.ch/pardo/program/archive/2017/catalog-films.html')
sel = Selector(text=self.driver.page_source)
#grab list of start pages for all 4/5 editions of festival available
#list of film page urls on start page (letter A)
film_page_urls_startpage = sel.xpath('//article[@class="strip- list_link_all strip-list strip--color row row--5"]/a/@href').extract()
film_page_urls_startpage_full = []
for url in film_page_urls_startpage:
film_page_fullurl = "https://pardo.ch" + url
film_page_urls_startpage_full.append(film_page_fullurl)
#navigate to startpage film_pages
for url3 in film_page_urls_startpage_full:
self.driver.get(url3)
sel = Selector(text=self.driver.page_source)
self.logger.info('Sleeping for 1 second')
sleep(1)
yield Request(url3, callback=self.parse_filmpage)
self.logger.info('Sleeping for 2 seconds')
sleep(2)
我的输出日志显示[你可以忽略错误,它只是一个页面导航错误,自修复后]:
2017-12-26 09:29:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: films_locarno)
2017-12-26 09:29:33 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['films_locarno.spiders'], 'BOT_NAME': 'films_locarno', 'NEWSPIDER_MODULE': 'films_locarno.spiders', 'FEED_URI': 'films_locarno6.csv', 'FEED_FORMAT': 'csv'}
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.images.ImagesPipeline']
2017-12-26 09:29:33 [scrapy.core.engine] INFO: Spider opened
2017-12-26 09:29:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-26 09:29:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-12-26 09:29:34 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session {"capabilities": {"firstMatch": [], "alwaysMatch": {"browserName": "firefox", "acceptInsecureCerts": true}}, "desiredCapabilities": {"browserName": "firefox", "acceptInsecureCerts": true}}
2017-12-26 09:29:41 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:41 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/catalog-films.html"}
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70"}
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:56 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:29:57 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:29:59 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70"}
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:03 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:04 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:06 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=968681&eid=70"}
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:09 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:10 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:12 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=959475&eid=70"}
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:14 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:15 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:17 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960897&eid=70"}
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:19 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:20 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:22 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960706&eid=70"}
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:25 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:26 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:28 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=929220&eid=70"}
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:32 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:33 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:35 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960742&eid=70"}
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:38 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-26 09:30:39 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:41 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960703&eid=70"}
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:44 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:45 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:47 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=963699&eid=70"}
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:50 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70> (referer: None)
2017-12-26 09:30:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70> (referer: None)
2017-12-26 09:30:51 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=964462&eid=70"}
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:58 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=968681&eid=70> (referer: None)
2017-12-26 09:30:59 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:02 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:05 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:31:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch<a href=\"?finit=B\" class=\"dd__list__link\">B</a>"}
2017-12-26 09:31:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:31:07 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/Users/MNK1/Desktop/films_locarno/films_locarno/spiders/films_locarno_spider.py", line 48, in start_requests
self.driver.get(films_list_page)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 268, in get
self.execute(Command.GET, {'url': url})
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Malformed URL: https://pardo.ch<a href="?finit=B" class="dd__list__link">B</a> is not a valid URL.
2017-12-26 09:31:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=959475&eid=70> (referer: None)
2017-12-26 09:31:07 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:10 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296> referred in <None>
2017-12-26 09:31:10 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296> referred in <None>
2017-12-26 09:31:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=960897&eid=70> (referer: None)
2017-12-26 09:31:10 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:13 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F430%2FOC973705_P3001_240430.jpg&w=539&h=296> referred in <None>
2017-12-26 09:31:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=960706&eid=70> (referer: None)
2017-12-26 09:31:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70>
{'color': ['Color'],
'country': ['Pakistan, USA'],
'director': [''],
'festival_edition': ['70th'],
'festival_year': ['2017'],
'film_year': ['2015'],
'format_': ['DCP'],
'image_urls': ['https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296'],
'images': [{'checksum': '89dd9751e436eed7ae35f980c2e10bc3',
'path': 'full/53cb39b642dcd6cea1e7898c9dc4777b844ea4fd.jpg',
'url': 'https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296'}],
'language': ['Urdu'],
'length': ["40'"],
'program': ['Open Doors: Screenings'],
'title': ['A Girl in the River: The Price of Forgiveness']}
2017-12-26 09:31:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70>
{'color': ['Color'],
'country': ['Switzerland'],
'director': [''],
'festival_edition': ['70th'],
'festival_year': ['2017'],
'film_year': ['2017'],
'format_': ['DCP'],
'image_urls': ['https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296'],
'images': [{'checksum': 'cce5e9ffd3bad2b359c489ac4c51c25e',
'path': 'full/84e0d100fc90acf2c0cfe8c38454a305e23b7408.jpg',
'url': 'https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296'}],
[[edited for length]]
2017-12-26 09:31:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3038,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 115519,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'file_count': 11,
'file_status_count/uptodate': 11,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 12, 26, 17, 31, 35, 820684),
'item_scraped_count': 11,
'log_count/DEBUG': 86,
'log_count/ERROR': 1,
'log_count/INFO': 43,
'memusage/max': 79556608,
'memusage/startup': 66007040,
'response_received_count': 11,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'start_time': datetime.datetime(2017, 12, 26, 17, 29, 33, 860768)}
2017-12-26 09:31:35 [scrapy.core.engine] INFO: Spider closed (finished)
答案 0 :(得分:0)
我查了一下
len(film_page_urls_startpage)
我只得到11而不是23。
如果我使用xpath('//article/a/@href')
,那么我会得到23个网址。
无需添加@class
。没有其他article
。
修改强>
如果我这样做
for item in sel.xpath('//article/@class').extract():
print('class:', item)
然后我得到
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
所以有些项目在字符串字符串中有even
,这就是你的问题。