Scrapy问题:抓取0页(0页/分钟)

时间:2015-11-30 00:13:22

标签: python scrapy

我正在尝试使用python学习Web抓取的工作原理。我试图使用Scrapy从晨星网站收集一些数据。基本上,我希望程序能够读取我的csv文件,其中包含一行morningstar网址。然后,我需要该程序来解析"其他类信息"晨星上的表。我的问题是我一直得到:抓0页(0页/分),刮0项(0项/分钟)。任何帮助,将不胜感激。

morningSpider.py

import scrapy
from scrapy.spiders import Spider, Rule
from scrapy.linkextractors import LinkExtractor
from .. import items
from scrapy.http import Request
import csv

def get_urls_from_csv():
     with open("C:\Users\kate\Desktop\morningStar\morningTest.csv", 'r') as f:
        data = csv.reader(f)
        scrapurls = []
        for row in data:
            for column in row:
                 scrapurls.append(column)
        return scrapurls

class morningSpider(Spider):
    name = "morningSpider"
    allowed_domains = []
    #start_urls = scrapurls

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item',   
        follow=True),
    )

    def start_requests(self):
        return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

    def parse(self, response):
        for sel in response.xpath('//table[@class="r_table1 text2"]//table/tr')[1:]:
            item = items.SpiderItem
            item['FundName'] = row.select("td[2]/text()").extract()[0]
            item['FrontLoad'] = row.select("td[3]/text()").extract()[0]
            item['DeferredLoad'] = row.select("td[4]/text()").extract()[0]
            item['ExpenseRatio'] = row.select("td[5]/text()").extract()[0]
            item['MinInitPurchase'] = row.select("td[6]/text()").extract([0]
            item['Actual'] = row.select("td[7]/text()").extract()[0]
            item['PurchaseConstraint'] = row.select("td[8]/text()").extract()[0]
            item['ShareClassAttributes'] = row.select("td[9]/text()").extract()[0]
            yield item

items.py

from scrapy.item import Item, Field

class MorningItem(Item):
       # define the fields for your item here like:
       # name = scrapy.Field()
       FundName = Field()
       FrontLoad = Field()
       DeferredLoad = Field()
       ExpenseRatio = Field()
       MinInitPurchase = Field()
       Actual = Field()
       PurchaseConstraint = Field()
       ShareClassAttributes = Field()
       pass

输出

2015-11-29 18:46:26 [scrapy] INFO: Scrapy 1.0.3 started (bot: morningScrape)
2015-11-29 18:46:26 [scrapy] INFO: Optional features available: ssl, http11
2015-11-29 18:46:26 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'morningScrape.spiders', 'SPIDER_MODULES':    
['morningScrape.spiders'], 'BOT_NAME': 'morningScrape'}
2015-11-29 18:46:26 [scrapy] INFO: Enabled extensions: CloseSpider,  TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-29 18:46:26 [scrapy] INFO: Enabled downloader middleware:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,     
Retry Middleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,   
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,   
ChunkedTransferMiddleware, DownloaderStats
2015-11-29 18:46:26 [scrapy] INFO: Enabled spider middlewares:  
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,   
UrlLengthMiddleware, DepthMiddleware
2015-11-29 18:46:26 [scrapy] INFO: Enabled item pipelines:
2015-11-29 18:46:26 [scrapy] INFO: Spider opened
2015-11-29 18:46:26 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)
2015-11-29 18:46:26 [scrapy] DEBUG: Telnet console listening on    

2015-11-29 18:46:26 [scrapy] DEBUG: Filtered duplicate request: <GET   
SOMEURL> - no more duplicates will be shown (see DUPEFILTER_DEBUG to 
show all duplicates)
2015-11-29 18:46:27 [scrapy] DEBUG: Crawled (200) <GET     
ANOTHERURL> (referer: None)
2015-11-29 18:46:27 [scrapy] INFO: Closing spider (finished)
2015-11-29 18:46:27 [scrapy] INFO: Dumping Scrapy stats:  
{'downloader/request_bytes': 267,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 8691,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 29, 23, 46, 27, 255000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 11, 29, 23, 46, 26, 848000)}
2015-11-29 18:46:27 [scrapy] INFO: Spider closed (finished)

2 个答案:

答案 0 :(得分:0)

我不知道你是否应该分享这些网址,但是检查日志中的网址,我看到第一个xpath没有获取任何项目,第二个网址不是公共访问。

然后,您将覆盖使用parse的{​​{1}}所需的CrawlSpider方法,但您使用的rules不使用Spider规则,因此它们不适用于您的抓取。

此外,如果您将Spider更改为CrawlSpider,则该规则不能包含follow=Truecallback!=None,因为它们是对立的。

答案 1 :(得分:0)

  1. 作为@eLRuLL的帖子,你应该使用scrapy.spiders.CrawlSpider

  2.   

    0页

    不重要。
    请参阅日志中的dupefilter/filtered。这意味着您的网址已被过滤 请参阅DUPEFILTER_CLASS中的scrapy.settings.default_settings设置,覆盖此设置以防止过滤。