在一套100多个网址上运行scrapy

时间:2015-05-31 13:50:37

标签: scrapy scrapy-spider

我需要从gsmarena下载一组手机的CPU和GPU数据。现在作为第一步,我通过运行scrapy下载了这些手机的网址并删除了不必要的项目。

同样的COde如下。

# -*- coding: utf-8 -*-
from  scrapy.selector import Selector
from scrapy import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from gsmarena_data.items import gsmArenaDataItem


class MobileInfoSpider(Spider):
name = "mobile_info"
allowed_domains = ["gsmarena.com"]
start_urls = (
# 'http://www.gsmarena.com/samsung-phones-f-9-10.php',
# 'http://www.gsmarena.com/apple-phones-48.php',
# 'http://www.gsmarena.com/microsoft-phones-64.php',
# 'http://www.gsmarena.com/nokia-phones-1.php',
# 'http://www.gsmarena.com/sony-phones-7.php',
# 'http://www.gsmarena.com/lg-phones-20.php',
# 'http://www.gsmarena.com/htc-phones-45.php',
# 'http://www.gsmarena.com/motorola-phones-4.php',
# 'http://www.gsmarena.com/huawei-phones-58.php',
# 'http://www.gsmarena.com/lenovo-phones-73.php',
# 'http://www.gsmarena.com/xiaomi-phones-80.php',
# 'http://www.gsmarena.com/acer-phones-59.php',
# 'http://www.gsmarena.com/asus-phones-46.php',
# 'http://www.gsmarena.com/oppo-phones-82.php',
# 'http://www.gsmarena.com/blackberry-phones-36.php',
# 'http://www.gsmarena.com/alcatel-phones-5.php',
# 'http://www.gsmarena.com/xolo-phones-85.php',
# 'http://www.gsmarena.com/lava-phones-94.php',
# 'http://www.gsmarena.com/micromax-phones-66.php',
# 'http://www.gsmarena.com/spice-phones-68.php',
'http://www.gsmarena.com/gionee-phones-92.php',
)

def parse(self, response):
phone = gsmArenaDataItem()
hxs = Selector(response)
phone_listings = hxs.css('.makers')

for phone_listings in phone_listings:
phone['model'] = phone_listings.xpath("ul/li/a/strong/text()").extract()
phone['link'] = phone_listings.xpath("ul/li/a/@href").extract()
yield phone

现在,我需要在这些网址上运行scrapy来获取CPU和GPU数据。所有这些信息都来自css selector =" .ttl"。

请指导如何在网址集上循环scrapy并在单个csv或json中输出数据。我很清楚会创建项目并使用css选择器。需要有关如何循环这些数百页的帮助。

I have a list of urls like:

www.gsmarena.com/samsung_galaxy_s5_cdma-6338.php
www.gsmarena.com/samsung_galaxy_s5-6033.php
www.gsmarena.com/samsung_galaxy_core_lte_g386w-6846.php
www.gsmarena.com/samsung_galaxy_core_lte-6099.php
www.gsmarena.com/acer_iconia_one_8_b1_820-7217.php
www.gsmarena.com/acer_iconia_tab_a3_a20-7136.php
www.gsmarena.com/microsoft_lumia_640_dual_sim-7082.php
www.gsmarena.com/microsoft_lumia_532_dual_sim-6951.php

Which are the links to phone descriptions on gsm arena.

Now I need to download the CPU and GPU info of the 100 models I have.

    I extracted the urls of those 100 models for which the data is required.

    The spider written for the same is,

    from  scrapy.selector import Selector
    from scrapy import Spider
    from gsmarena_data.items import gsmArenaDataItem

    class MobileInfoSpider(Spider):
    name = "cpu_gpu_info"
    allowed_domains = ["gsmarena.com"]
    start_urls = (
    "http://www.gsmarena.com/microsoft_lumia_435_dual_sim-6949.php",
    "http://www.gsmarena.com/microsoft_lumia_435-6942.php",
    "http://www.gsmarena.com/microsoft_lumia_535_dual_sim-6792.php",
    "http://www.gsmarena.com/microsoft_lumia_535-6791.php",
    )
    def parse(self, response):
    phone = gsmArenaDataItem()
    hxs = Selector(response)
    cpu_gpu = hxs.css('.ttl')
    for phone_listings in phone_listings:
    phone['cpu'] = cpu_gpu.xpath("ul/li/a/strong/text()").extract()
    phone['gpu'] = cpu_gpu.xpath("ul/li/a/@href").extract()
    yield phone

如果我能以某种方式运行我想要提取此数据的网址,我可以在一个csv文件中获取所需的数据。

1 个答案:

答案 0 :(得分:0)

我认为您需要来自每个供应商的信息。如果是这样,您不必在{{$exam->session->libellesess}}中添加这些数百个网址,或者您可以在start-url之后使用此link作为start-url,您可以提取这些网址以编程方式处理您想要的内容。

answer会帮助您这样做。