我正在尝试使用python学习Web抓取的工作原理。我试图使用Scrapy从晨星网站收集一些数据。基本上,我希望程序能够读取我的csv文件,其中包含一行morningstar网址。然后,我需要该程序来解析"其他类信息"晨星上的表。我的问题是我一直得到:抓0页(0页/分),刮0项(0项/分钟)。任何帮助,将不胜感激。
morningSpider.py
import scrapy
from scrapy.spiders import Spider, Rule
from scrapy.linkextractors import LinkExtractor
from .. import items
from scrapy.http import Request
import csv
def get_urls_from_csv():
with open("C:\Users\kate\Desktop\morningStar\morningTest.csv", 'r') as f:
data = csv.reader(f)
scrapurls = []
for row in data:
for column in row:
scrapurls.append(column)
return scrapurls
class morningSpider(Spider):
name = "morningSpider"
allowed_domains = []
#start_urls = scrapurls
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item',
follow=True),
)
def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
def parse(self, response):
for sel in response.xpath('//table[@class="r_table1 text2"]//table/tr')[1:]:
item = items.SpiderItem
item['FundName'] = row.select("td[2]/text()").extract()[0]
item['FrontLoad'] = row.select("td[3]/text()").extract()[0]
item['DeferredLoad'] = row.select("td[4]/text()").extract()[0]
item['ExpenseRatio'] = row.select("td[5]/text()").extract()[0]
item['MinInitPurchase'] = row.select("td[6]/text()").extract([0]
item['Actual'] = row.select("td[7]/text()").extract()[0]
item['PurchaseConstraint'] = row.select("td[8]/text()").extract()[0]
item['ShareClassAttributes'] = row.select("td[9]/text()").extract()[0]
yield item
items.py
from scrapy.item import Item, Field
class MorningItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
FundName = Field()
FrontLoad = Field()
DeferredLoad = Field()
ExpenseRatio = Field()
MinInitPurchase = Field()
Actual = Field()
PurchaseConstraint = Field()
ShareClassAttributes = Field()
pass
输出
2015-11-29 18:46:26 [scrapy] INFO: Scrapy 1.0.3 started (bot: morningScrape)
2015-11-29 18:46:26 [scrapy] INFO: Optional features available: ssl, http11
2015-11-29 18:46:26 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'morningScrape.spiders', 'SPIDER_MODULES':
['morningScrape.spiders'], 'BOT_NAME': 'morningScrape'}
2015-11-29 18:46:26 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-29 18:46:26 [scrapy] INFO: Enabled downloader middleware:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
Retry Middleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
ChunkedTransferMiddleware, DownloaderStats
2015-11-29 18:46:26 [scrapy] INFO: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
UrlLengthMiddleware, DepthMiddleware
2015-11-29 18:46:26 [scrapy] INFO: Enabled item pipelines:
2015-11-29 18:46:26 [scrapy] INFO: Spider opened
2015-11-29 18:46:26 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2015-11-29 18:46:26 [scrapy] DEBUG: Telnet console listening on
2015-11-29 18:46:26 [scrapy] DEBUG: Filtered duplicate request: <GET
SOMEURL> - no more duplicates will be shown (see DUPEFILTER_DEBUG to
show all duplicates)
2015-11-29 18:46:27 [scrapy] DEBUG: Crawled (200) <GET
ANOTHERURL> (referer: None)
2015-11-29 18:46:27 [scrapy] INFO: Closing spider (finished)
2015-11-29 18:46:27 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 267,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 8691,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 29, 23, 46, 27, 255000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 11, 29, 23, 46, 26, 848000)}
2015-11-29 18:46:27 [scrapy] INFO: Spider closed (finished)
答案 0 :(得分:0)
我不知道你是否应该分享这些网址,但是检查日志中的网址,我看到第一个xpath没有获取任何项目,第二个网址不是公共访问。
然后,您将覆盖使用parse
的{{1}}所需的CrawlSpider
方法,但您使用的rules
不使用Spider
规则,因此它们不适用于您的抓取。
此外,如果您将Spider
更改为CrawlSpider
,则该规则不能包含follow=True
和callback!=None
,因为它们是对立的。
答案 1 :(得分:0)
作为@eLRuLL的帖子,你应该使用scrapy.spiders.CrawlSpider
0页
不重要。
请参阅日志中的dupefilter/filtered
。这意味着您的网址已被过滤
请参阅DUPEFILTER_CLASS
中的scrapy.settings.default_settings
设置,覆盖此设置以防止过滤。