我是Scrapy的新手。我在数据库中有成千上万的url,xpath元组和值。 这些网址来自不同的域名(不是总的来说,同一个域名可能有100个网址)。
x.com/a //h1
y.com/a //div[@class='1']
z.com/a //div[@href='...']
x.com/b //h1
x.com/c //h1
...
现在我希望每2小时尽快获得这些值,但要确保我不会超载任何这些值。
无法弄清楚如何做到这一点。
我的想法:
我可以为每个不同的域创建一个Spider,设置它的解析规则并立即运行它们。
这是一个好习惯吗?
编辑: 我不确定如何根据并发性将数据输出到数据库中。
EDIT2:
我可以这样做 - 每个域都有一个新的蜘蛛。但这不可能有数以千计的不同网址和x路径。
class WikiScraper(scrapy.Spider):
name = "wiki_headers"
def start_requests(self):
urls = [
'https://en.wikipedia.org/wiki/Spider',
'https://en.wikipedia.org/wiki/Data_scraping',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//h1/text()').extract()
print header
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
class CraigslistScraper(scrapy.Spider):
name = "craigslist_headers"
def start_requests(self):
urls = [
'https://columbusga.craigslist.org/act/6062657418.html',
'https://columbusga.craigslist.org/acc/6060297390.html',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//span[@id="titletextonly"]/text()').extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
答案 0 :(得分:1)
从您在edit2中发布的示例中,看起来您的所有类都可以通过一个级别轻松抽象。这个怎么样:?
from urllib.parse import urlparse
class GenericScraper(scrapy.Spider):
def __init__(self, urls, xpath):
super().__init__()
self.name = self._create_scraper_name_from_url(urls[0])
self.urls = urls
self.xpath = xpath
def _create_scraper_name_from_url(url):
'''Generate scraper name from url
www.example.com/foobar/bar -> www_example_com'''
netloc = urlparse(url).netloc
return netloc.replace('.','_')
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select(self.xpath).extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
接下来,您可以按xpaths
对数据库中的数据进行分组for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
# do whatever you need with scraper
AD concurency:您的数据库应该处理concurent写入,所以我没有看到问题
编辑: 与超时相关:我不知道scrapy如何在引擎盖下工作,即它是否使用某种类型的并列化以及它是否在后台异步运行。但是根据你所写的内容,我猜它确实如此,当你启动1k个刮刀时,每次发出多个请求,你的硬件无法处理那么多的流量(免责声明,这只是猜测!)。
可能有一种原生方式可以执行此操作,但可能的解决方法是使用multiprocessing +队列:
from multiprocessing import JoinableQueue, Process
NUMBER_OF_CPU = 4 # change this to your number.
SENTINEL = None
class Worker(Process):
def __init__(self, queue):
super().__init__()
self.queue = queue
def run(self):
# blocking wait !You have to use sentinels if you use blocking waits!
item = self.queue.get():
if item is SENTINEL:
# we got sentinel, there are no more scrapers to process
self.queue.task_done()
return
else:
# item is scraper, run it
item.run_spider() # or however you run your scrapers
# This assumes that each scraper is **not** running in background!
# Tell the JoinableQueue we have processed one more item
# In the main thread the queue.join() waits untill for
# each item taken from queue a queue.task_done() is called
self.queue.task_done()
def run():
queue = JoinableQueue()
# if putting that many things in the queue gets slow (I imagine
# it can) You can fire up a separate Thread/Process to fill the
# queue in the background while workers are already consuming it.
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
queue.put(scraper)
for sentinel in range(NUMBER_OF_CPU):
# None or sentinel of your choice to tell the workers there are
# no more scrapers to process
queue.put(SENTINEL)
workers = []
for _ in range(NUMBER_OF_CPU):
worker = Worker(queue)
workers.append(worker)
worker.start()
# We have to wait until the queue is processed
queue.join()
但请记住,这是一种完全无视Scrapy能力的并行执行的香草方法。我发现This blogpost使用twisted
来实现(我认为)同样的事情。但由于我从未使用过扭曲,我无法评论
答案 1 :(得分:0)
如果您因为scrapy
参数而考虑allowed_domains
无法一次处理多个域,请记住它是可选的。
如果蜘蛛中没有设置allowed_domains
参数,它可以与它获得的每个域一起使用。
答案 2 :(得分:-1)
如果我理解正确您有域映射到xpath值,并且您想根据您抓取的域拉取xpath吗?
尝试类似:
DOMAIN_DATA = [('domain.com', '//div')]
def get_domain(url):
for domain, xpath in DOMAIN_DATA:
if domain in url:
return xp
def parse(self, response):
xpath = get_domain(response.url)
if not xpath:
logging.error('no xpath for url: {}; unknown domain'.format(response.url))
return
item = dict()
item['some_field'] = repsonse.xpath(xpath).extract()
yield item