我一直在尝试学习Scrapy教程,在项目顶层运行命令后,我得到以下输出:
2016-07-05 21:06:01 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial)
2016-07-05 21:06:01 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
2016-07-05 21:06:01 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-07-05 21:06:02 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-05 21:06:02 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-05 21:06:02 [scrapy] INFO: Enabled item pipelines:
[]
2016-07-05 21:06:02 [scrapy] INFO: Spider opened
2016-07-05 21:06:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-05 21:06:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-07-05 21:06:02 [scrapy] INFO: Closing spider (finished)
2016-07-05 21:06:02 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 5, 13, 6, 2, 381000),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 7, 5, 13, 6, 2, 381000)}
2016-07-05 21:06:02 [scrapy] INFO: Spider closed (finished)

dmoz.py是......
# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import TutorialItem
class DmozSpider(scrapy.Spider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
strat_urls = ('http://www.dmoz.org/Computers/Programming/Languages/Python/Books/')
def parse(self,response):
lislink = response.xpath('/html/body/div[5]/div/section[3]/div/div/div[*]/div[3]/a')
for li in lislink:
item = TutorialItem()
item['link'] = li.xpath('@href').extract()
yield item

items.py是......
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
pass

但是,在shell中调试项目时,我可以获取URL。
D:\pythonweb\scrapy\test2>scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
2016-07-05 21:06:40 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-07-05 21:06:40 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-07-05 21:06:40 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-07-05 21:06:40 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-05 21:06:40 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-05 21:06:40 [scrapy] INFO: Enabled item pipelines:
[]
2016-07-05 21:06:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-07-05 21:06:40 [scrapy] INFO: Spider opened
2016-07-05 21:06:42 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x03BF0E30>
[s] item {}
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] settings <scrapy.settings.Settings object at 0x03BF05F0>
[s] spider <DefaultSpider 'default' at 0x432b1d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>> lislink = response.xpath('/html/body/div[5]/div/section[3]/div/div/div[*]/div[3]/a')
>>> lislink.xpath('@href').extract()
[u'http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html', u'http://www.brpreiss.com/books/opus7/html/book.html', u'http://www.diveintopython.net/', u'http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/', u'http://www.techbooksforfree.com/perlpython.shtml', u'http://www.freetechbooks.com/python-f6.html', u'http://greenteapress.com/thinkpython/', u'http://www.network-theory.co.uk/python/intro/', u'http://www.freenetpages.co.uk/hp/alan.gauld/', u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html', u'http://hetland.org/writing/practical-python/', u'http://sysadminpy.com/', u'http://www.qtrac.eu/py3book.html', u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html', u'https://www.packtpub.com/python-3-object-oriented-programming/book', u'http://www.network-theory.co.uk/python/language/', u'http://www.pearsonhighered.com/educator/academic/product/0,,0130409561,00%2Ben-USS_01DBC.html', u'http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1', u'http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html', u'http://www.informit.com/store/product.aspx?isbn=0672317354', u'http://gnosis.cx/TPiP/', u'http://www.informit.com/store/product.aspx?isbn=0130211192']
>>>
&#13;
这是我的平台。
Scrapy : 1.1.0
lxml : 3.6.0.0
libxml2 : 2.9.0
Twisted : 16.2.0
Python : 2.7.12 (v2.7.12:d33e0cf91556, Jun 27 2016, 15:19:22) [MSC v.1500 32 bit (Intel)]
pyOpenSSL : 16.0.0 (OpenSSL 1.0.2h 3 May 2016)
Platform : Windows-10-10.0.10586
&#13;
答案 0 :(得分:1)
它不是dic={'a':[1,2,3],'b':[10,1,2],'c':[20,1,2],'d':[4,3,2,1],'e':[5,1],'f':[90,8,2]}
for x in dic:
print ("sum of key " + str(x) + " " + str(sum(dic[x])))
print ("length = " + str(len(dic[x])))
,而是sum of key a 6
length = 3
sum of key c 23
length = 3
sum of key b 13
length = 3
sum of key e 6
length = 2
sum of key d 10
length = 4
sum of key f 100
length = 3
,它必须是可迭代的(通常是一个列表),其中url为项目:
strat_urls