如何从网站上抓取多个页面?

时间:2017-12-08 19:34:41

标签: python web-scraping scrapy

(非常)Python和一般编程新手

我一直在尝试使用Scrapy从同一网站的更多页面/部分抓取数据

我的代码有效,但它不可读且不实用

import scrapy

class SomeSpider(scrapy.Spider):
    name = 'some'
    allowed_domains = ['https://example.com']
    start_urls = [
        'https://example.com/Python/?k=books&p=1',
        'https://example.com/Python/?k=books&p=2',
        'https://example.com/Python/?k=books&p=3',
        'https://example.com/Python/?k=tutorials&p=1',
        'https://example.com/Python/?k=tutorials&p=2',
        'https://example.com/Python/?k=tutorials&p=3',
     ]

     def parse(self, response):
         response.selector.remove_namespaces()

         info1 = response.css("scrapedinfo1").extract()
         info2 = response.css("scrapedinfo2").extract()

         for item in zip(scrapedinfo1, scrapedinfo2):
           scraped_info = {
              'scrapedinfo1': item[0],
              'scrapedinfo2': item[1]}

              yield scraped_info

我该如何改善这个?

我想在一定数量的类别和网页中进行搜索

我需要像

这样的东西
categories = [books, tutorials, a, b, c, d, e, f] 
in a range(1,3)

因此,Scrapy可以在所有类别和页面中完成工作,同时易于编辑和适应其他网站

欢迎任何想法

我尝试过:

categories = ["books", "tutorials"]
base = "https://example.com/Python/?k={category}&p={index}"

def url_generator():
    for category, index in itertools.product(categories, range(1, 4)):
        yield base.format(category=category, index=index)

但Scrapy返回

[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)

2 个答案:

答案 0 :(得分:1)

感谢start_requests()yield scrapy.Request()

这是代码

import scrapy
import itertools


class SomeSpider(scrapy.Spider):
    name = 'somespider'
    allowed_domains = ['example.com']

    def start_requests(self):
        categories = ["books", "tutorials"]
        base = "https://example.com/Python/?k={category}&p={index}"

        for category, index in itertools.product(categories, range(1, 4)):
            yield scrapy.Request(base.format(category=category, index=index))

    def parse(self, response):
        response.selector.remove_namespaces()

        info1 = response.css("scrapedinfo1").extract()
        info2 = response.css("scrapedinfo2").extract()

        for item in zip(info1, info2):
            scraped_info = {
                'scrapedinfo1': item[0],
                'scrapedinfo2': item[1],
            }

            yield scraped_info

答案 1 :(得分:0)

您可以使用方法start_requests()在开始时使用yield Request(url)生成网址。

顺便说一句:在parse()的后期,您还可以使用yield Request(url)添加新网址。

我使用为测试蜘蛛而创建的门户网站toscrape.com

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['http://quotes.toqoute.com']

    #start_urls = []

    tags = ['love', 'inspirational', 'life', 'humor', 'books', 'reading']
    pages = 3
    url_template = 'http://quotes.toscrape.com/tag/{}/page/{}'

    def start_requests(self):

        for tag in self.tags:
            for page in range(self.pages):
                url = self.url_template.format(tag, page)
                yield scrapy.Request(url)


    def parse(self, response):
        # test if method was executed
        print('url:', response.url)

# --- run it without project ---

from scrapy.crawler import CrawlerProcess

#c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
#    'FEED_FORMAT': 'csv',
#    'FEED_URI': 'output.csv',
#}

c = CrawlerProcess()
c.crawl(MySpider)
c.start()