广泛的爬行 - 不同的xpaths - Scrapy

时间:2017-03-28 22:26:13

标签: python web-scraping scrapy web-crawler

我是Scrapy的新手。我在数据库中有成千上万的url,xpath元组和值。 这些网址来自不同的域名(不是总的来说,同一个域名可能有100个网址)。

x.com/a //h1
y.com/a //div[@class='1']
z.com/a //div[@href='...']
x.com/b //h1
x.com/c //h1
...

现在我希望每2小时尽快获得这些值,但要确保我不会超载任何这些值。

无法弄清楚如何做到这一点。

我的想法:

我可以为每个不同的域创建一个Spider,设置它的解析规则并立即运行它们。

这是一个好习惯吗?

编辑: 我不确定如何根据并发性将数据输出到数据库中。

EDIT2:

我可以这样做 - 每个域都有一个新的蜘蛛。但这不可能有数以千计的不同网址和x路径。

class WikiScraper(scrapy.Spider):
    name = "wiki_headers"

    def start_requests(self):
        urls = [
            'https://en.wikipedia.org/wiki/Spider',
            'https://en.wikipedia.org/wiki/Data_scraping',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        header = hxs.select('//h1/text()').extract()
        print header
        filename = 'result.txt'
        with open(filename, 'a') as f:
            f.write(header[0])
        self.log('Saved file %s' % filename)

class CraigslistScraper(scrapy.Spider):
    name = "craigslist_headers"

    def start_requests(self):
        urls = [
            'https://columbusga.craigslist.org/act/6062657418.html',
            'https://columbusga.craigslist.org/acc/6060297390.html',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        header = hxs.select('//span[@id="titletextonly"]/text()').extract()
        filename = 'result.txt'
        with open(filename, 'a') as f:
            f.write(header[0])
        self.log('Saved file %s' % filename)

3 个答案:

答案 0 :(得分:1)

从您在edit2中发布的示例中,看起来您的所有类都可以通过一个级别轻松抽象。这个怎么样:?

from urllib.parse import urlparse

class GenericScraper(scrapy.Spider):
    def __init__(self, urls, xpath):
        super().__init__()
        self.name = self._create_scraper_name_from_url(urls[0])
        self.urls = urls
        self.xpath = xpath

    def _create_scraper_name_from_url(url):
        '''Generate scraper name from url
           www.example.com/foobar/bar -> www_example_com'''
        netloc = urlparse(url).netloc
        return netloc.replace('.','_')

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        header = hxs.select(self.xpath).extract()
        filename = 'result.txt'
        with open(filename, 'a') as f:
            f.write(header[0])
        self.log('Saved file %s' % filename)

接下来,您可以按xpaths

对数据库中的数据进行分组
for urls, xpath in grouped_data:
    scraper = GenericScraper(urls, xpath)
    # do whatever you need with scraper

AD concurency:您的数据库应该处理concurent写入,所以我没有看到问题

编辑: 与超时相关:我不知道scrapy如何在引擎盖下工作,即它是否使用某种类型的并列化以及它是否在后台异步运行。但是根据你所写的内容,我猜它确实如此,当你启动1k个刮刀时,每次发出多个请求,你的硬件无法处理那么多的流量(免责声明,这只是猜测!)。

可能有一种原生方式可以执行此操作,但可能的解决方法是使用multiprocessing +队列:

from multiprocessing import JoinableQueue, Process
NUMBER_OF_CPU = 4 # change this to your number.
SENTINEL = None


class Worker(Process):
    def __init__(self, queue):
        super().__init__()
        self.queue = queue
    def run(self):
        # blocking wait !You have to use sentinels if you use blocking waits!
        item = self.queue.get():
        if item is SENTINEL:
            # we got sentinel, there are no more scrapers to process
            self.queue.task_done()
            return
        else:
            # item is scraper, run it
            item.run_spider() # or however you run your scrapers
            # This assumes that each scraper is **not** running in background! 

            # Tell the JoinableQueue we have processed one more item
            # In the main thread the queue.join() waits untill for
            # each item taken from queue a queue.task_done() is called
            self.queue.task_done()


def run():
    queue = JoinableQueue()
    # if putting that many things in the queue gets slow (I imagine
    # it can) You can fire up a separate Thread/Process to fill the
    # queue in the background while workers are already consuming it.
    for urls, xpath in grouped_data:
        scraper = GenericScraper(urls, xpath)
        queue.put(scraper)
    for sentinel in range(NUMBER_OF_CPU):
        # None or sentinel of your choice to tell the workers there are 
        # no more scrapers to process
        queue.put(SENTINEL)
    workers = []
    for _ in range(NUMBER_OF_CPU):
        worker = Worker(queue)
        workers.append(worker)
        worker.start()

    # We have to wait until the queue is processed
    queue.join()

但请记住,这是一种完全无视Scrapy能力的并行执行的香草方法。我发现This blogpost使用twisted来实现(我认为)同样的事情。但由于我从未使用过扭曲,我无法评论

答案 1 :(得分:0)

如果您因为scrapy参数而考虑allowed_domains无法一次处理多个域,请记住它是可选的。

如果蜘蛛中没有设置allowed_domains参数,它可以与它获得的每个域一起使用。

答案 2 :(得分:-1)

如果我理解正确您有域映射到xpath值,并且您想根据您抓取的域拉取xpath吗?
尝试类似:

DOMAIN_DATA = [('domain.com', '//div')] 
def get_domain(url):
    for domain, xpath in DOMAIN_DATA:
        if domain in url: 
            return xp


def parse(self, response):
    xpath = get_domain(response.url)
    if not xpath:
        logging.error('no xpath for url: {}; unknown domain'.format(response.url))
        return
    item = dict()
    item['some_field'] = repsonse.xpath(xpath).extract()
    yield item