我的网络搜寻器(python,Scrapy,Scrapy-splash)如何能够更快地爬行?

时间:2019-03-15 05:46:34

标签: mongodb scrapy web-crawler python-3.7 scrapy-splash

开发环境:

  • CentOS7
  • 点18.1
  • Docker版本18.09.3,内部版本774a1f4
  • anaconda命令行客户端(版本1.7.2)
  • Python3.7
  • Scrapy 1.6.0
  • scrapy-splash
  • MongoDB(数据库版本v4.0.6)
  • PyCharm

服务器规格:

  • CPU-> 处理器:22, vendor_id:正版Intel, cpu家庭:6岁 型号:45, 型号名称:Intel(R)Xeon(R)CPU E5-2430 0 @ 2.20GHz
  • RAM->内存:31960
  • 64位

你好。

我是php开发人员,这是我的第一个python项目。我正在尝试使用python,因为听说python对于网络抓取有很多好处。

我正在爬网一个动态网站,我需要每5到15秒爬网3500个页面。目前,我的速度太慢了。每分钟仅抓取200页。

我的来源是这样的:

main.py

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spiders.bot1 import Bot1Spider
from spiders.bot2 import Bot2Spider
from spiders.bot3 import Bot3Spider
from spiders.bot4 import Bot4Spider
from pprint import pprint


process = CrawlerProcess(get_project_settings())
process.crawl(Oddsbot1Spider)
process.crawl(Oddsbot2Spider)
process.crawl(Oddsbot3Spider)
process.crawl(Oddsbot4Spider)
process.start()

bot1.py

import scrapy
import datetime
import math

from scrapy_splash import SplashRequest
from pymongo import MongoClient
from pprint import pprint


class Bot1Spider(scrapy.Spider):
    name = 'bot1'
    client = MongoClient('localhost', 27017)
    db = client.db

    def start_requests(self):
        count = int(self.db.games.find().count())
        num = math.floor(count*0.25)
        start_urls = self.db.games.find().limit(num-1)

        for url in start_urls:
            full_url = domain + list(url.values())[5]
            yield SplashRequest(full_url, self.parse, args={'wait': 0.1}, meta={'oid': list(url.values())[0]})

    def parse(self, response):
        pass

settings.py

BOT_NAME = 'crawler'

SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'


# Scrapy Configuration

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'my-project-name (www.my.domain)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 64

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16

执行这些代码时,我使用以下命令: python main.py

看到我的代码后,请帮助我。我会很乐意听任何话。

1。我的Spider如何更快?我尝试使用线程,但似乎无法正常工作。

2。网络爬网的最佳性能是什么?

3。是否可以每5-15秒抓取3500个动态页面?

谢谢。

0 个答案:

没有答案