我正在尝试从gsmarena页面下载数据:“http://www.gsmarena.com/htc_one_me-7275.php”。
但是,数据以表格和表格行的形式分类。 数据格式为:
table header > td[@class='ttl'] > td[@class='nfo']
编辑代码:感谢stackexchange社区成员的帮助,我将代码重新格式化为: Items.py文件:
import scrapy
class gsmArenaDataItem(scrapy.Item):
phoneName = scrapy.Field()
phoneDetails = scrapy.Field()
pass
蜘蛛锉:
from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem
class testSpider(Spider):
name = "mobile_test"
allowed_domains = ["gsmarena.com"]
start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)
def parse(self, response):
# extract whatever stuffs you want and yield items here
hxs = Selector(response)
phone = gsmArenaDataItem()
tableRows = hxs.css("div#specs-list table")
for tableRows in tableRows:
phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
for ttl in tableRows.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
colonSign = ": "
commaSign = ", "
seq = [ttl_value, colonSign, nfo_value, commaSign]
phone['phoneDetails'] = "".join(seq)
yield phone
然而,一旦我尝试使用以下方法在scrapy shell中加载页面,我就会被禁止:
"http://www.gsmarena.com/htc_one_me-7275.php"
我甚至尝试在settings.py。
中使用DOWNLOAD_DELAY = 3请建议我该怎么做。
答案 0 :(得分:1)
想法是迭代" spec-list"中的所有table
元素,获取块名称的th
元素,获取所有td
带有class="ttl"
的元素以及与td
相对应的class="nfo"
个兄弟姐妹。
来自shell的演示:
In [1]: for scope in response.css("div#specs-list table"):
scope_name = scope.xpath(".//th/text()").extract()[0]
for ttl in scope.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
print scope_name, ttl_value, nfo_value
....:
Network Technology GSM / HSPA / LTE
Network 2G bands GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2
...
Battery Stand-by Up to 598 h (2G) / Up to 626 h (3G)
Battery Talk time Up to 23 h (2G) / Up to 13 h (3G)
Misc Colors Meteor Grey, Rose Gold, Gold Sepia
答案 1 :(得分:0)
我也面临同样的问题,即在少数请求中被禁止,使用scrapy-proxies更改代理并使用autothrottling帮助显着,但没有完全解决问题。
您可以在gsmarenacrawler
找到我的代码