Question

我正在尝试从gsmarena页面下载数据：“http://www.gsmarena.com/htc_one_me-7275.php”。

但是，数据以表格和表格行的形式分类。数据格式为：

table header > td[@class='ttl'] > td[@class='nfo']

编辑代码：感谢stackexchange社区成员的帮助，我将代码重新格式化为： Items.py文件：

import scrapy

class gsmArenaDataItem(scrapy.Item):
    phoneName = scrapy.Field()
    phoneDetails = scrapy.Field()
    pass

蜘蛛锉：

from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem

class testSpider(Spider):
    name = "mobile_test"
    allowed_domains = ["gsmarena.com"]
    start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)

    def parse(self, response):
        # extract whatever stuffs you want and yield items here
        hxs = Selector(response)
        phone = gsmArenaDataItem()
        tableRows = hxs.css("div#specs-list table")
        for tableRows in tableRows:
            phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
            for ttl in tableRows.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
                colonSign = ": "
                commaSign = ", "
                seq = [ttl_value, colonSign, nfo_value, commaSign]
                phone['phoneDetails'] = "".join(seq)
        yield phone

然而，一旦我尝试使用以下方法在scrapy shell中加载页面，我就会被禁止：

"http://www.gsmarena.com/htc_one_me-7275.php"

我甚至尝试在settings.py。

中使用DOWNLOAD_DELAY = 3

请建议我该怎么做。

Answer 1

想法是迭代＆＃34; spec-list＆＃34;中的所有table元素，获取块名称的th元素，获取所有td带有class="ttl"的元素以及与td相对应的class="nfo"个兄弟姐妹。

来自shell的演示：

In [1]: for scope in response.css("div#specs-list table"):
            scope_name = scope.xpath(".//th/text()").extract()[0]

            for ttl in scope.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())

                print scope_name, ttl_value, nfo_value
   ....:     
Network Technology GSM / HSPA / LTE
Network 2G bands GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2
...
Battery Stand-by Up to 598 h (2G) / Up to 626 h (3G)
Battery Talk time Up to 23 h (2G) / Up to 13 h (3G)
Misc Colors Meteor Grey, Rose Gold, Gold Sepia

Answer 2

我也面临同样的问题，即在少数请求中被禁止，使用scrapy-proxies更改代理并使用autothrottling帮助显着，但没有完全解决问题。

您可以在gsmarenacrawler

找到我的代码

使用scrapy从gsmarena页面提取数据

2 个答案: