使用Scrapy从分页页面中提取3级内容

时间:2015-04-08 14:15:26

标签: scrapy

我有一个种子网址(比如DOMAIN/manufacturers.php)没有分页,看起来像这样:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>

<body>
    <div class="st-text">
        <table cellspacing="6" width="600">
            <tr>
                <td>
                    <a href="manufacturer1-type-59.php"></a>
                </td>

                <td>
                    <a href="manufacturer1-type-59.php">Name 1</a>
                </td>

                <td>
                    <a href="manufacturer2-type-5.php"></a>
                </td>

                <td>
                    <a href="manufacturer2-type-5.php">Name 2</a>
                </td>
            </tr>

            <tr>
                <td>
                    <a href="manufacturer3-type-88.php"></a>
                </td>

                <td>
                    <a href="manufacturer3-type-88.php">Name 3</a>
                </td>

                <td>
                    <a href="manufacturer4-type-76.php"></a>
                </td>

                <td>
                    <a href="manufacturer4-type-76.php">Name 4</a>
                </td>
            </tr>

            <tr>
                <td>
                    <a href="manufacturer5-type-28.php"></a>
                </td>

                <td>
                    <a href="manufacturer5-type-28.php">Name 5</a>
                </td>

                <td>
                    <a href="manufacturer6-type-48.php"></a>
                </td>

                <td>
                    <a href="manufacturer6-type-48.php">Name 6</a>
                </td>
            </tr>
        </table>
    </div>
</body>
</html>

从那里我想得到所有a['href'] 's,例如:manufacturer1-type-59.php。请注意,这些链接已不包含DOMAIN前缀,所以我的猜测是我必须以某种方式添加它,或者可能不是?

或者,我希望将这些链接保留在memory(适用于下一阶段),并将其保存到disk以供将来参考。

每个链接的内容(例如manufacturer1-type-59.php)如下所示:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>

<body>
    <div class="makers">
        <ul>
            <li>
                <a href="manufacturer1_model1_type1.php"></a>
            </li>

            <li>
                <a href="manufacturer1_model1_type2.php"></a>
            </li>

            <li>
                <a href="manufacturer1_model2_type3.php"></a>
            </li>
        </ul>
    </div>

    <div class="nav-band">
        <div class="nav-items">
            <div class="nav-pages">
                <span>Pages:</span><strong>1</strong>
                <a href="manufacturer1-type-STRING-59-INT-p2.php">2</a>
                <a href="manufacturer1-type-STRING-59-INT-p3.php">3</a>
                <a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a>
            </div>
        </div>
    </div>
</body>
</html>

接下来,我想获取所有a['href'] 's,例如manufacturer_model1_type1.php。再次注意,这些链接不包含域前缀。这里的另一个困难是这些页面支持分页。所以,我也想进入所有这些页面。正如所料,manufacturer-type-59.php重定向到manufacturer-type-STRING-59-INT-p2.php

或者,我还希望将这些链接保留在memory(适用于下一阶段),并将其保存到disk以供将来参考。

第三步也是最后一步应该是检索manufacturer_model1_type1.php类型的所有页面的内容,提取标题,并将结果保存为以下格式的文件:( url,title,)。

修改

这是我到目前为止所做的,但似乎没有效果......

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class ArchiveItem(scrapy.Item):
    url = scrapy.Field()

class ArchiveSpider(CrawlSpider):
    name = 'gsmarena'
    allowed_domains = ['gsmarena.com']
    start_urls = ['http://www.gsmarena.com/makers.php3']
    rules = [
        Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])),
        Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),
        Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),
    ]

    def parse_archive(self, response):
        torrent = ArchiveItem()
        torrent['url'] = response.url
        return torrent

1 个答案:

答案 0 :(得分:2)

我认为你最好使用BaseSpider而不是CrawlSpider

此代码可能会有所帮助

class GsmArenaSpider(Spider):
    name = 'gsmarena'
    start_urls = ['http://www.gsmarena.com/makers.php3', ]
    allowed_domains = ['gsmarena.com']
    BASE_URL = 'http://www.gsmarena.com/'

def parse(self, response):
    markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()
    if markers:
        for marker in markers:
            yield Request(url=self.BASE_URL + marker, callback=self.parse_marker)

def parse_marker(self, response):
    url = response.url
    # extracting phone urls
    phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()
    if phones:
        for phone in phones:
            # change callback function name as parse_events for first crawl
            yield Request(url=self.BASE_URL + phone, callback=self.parse_phone)
    else:
        return

    # pagination
    next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract()
    if next_page:
        yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker)

def parse_phone(self, response):
    # extract whatever stuffs you want and yield items here
    pass

修改

如果您想要跟踪这些电话网址的来源,您可以将 meta 解析 传递到parse_phone parse_marker 然后请求看起来像

 yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})

yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})