我正在尝试从gsmarena下载数据。下载HTC one me规范的示例代码来自以下网站:“http://www.gsmarena.com/htc_one_me-7275.php”,如下所述。
网站上的数据以表格和表格的形式分类。 数据格式为:
table header > td[@class='ttl'] > td[@class='nfo']
Items.py文件:
import scrapy
class gsmArenaDataItem(scrapy.Item):
phoneName = scrapy.Field()
phoneDetails = scrapy.Field()
pass
蜘蛛锉:
from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem
class testSpider(Spider):
name = "mobile_test"
allowed_domains = ["gsmarena.com"]
start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)
def parse(self, response):
# extract whatever stuffs you want and yield items here
hxs = Selector(response)
phone = gsmArenaDataItem()
tableRows = hxs.css("div#specs-list table")
for tableRows in tableRows:
phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
for ttl in tableRows.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
colonSign = ": "
commaSign = ", "
seq = [ttl_value, colonSign, nfo_value, commaSign]
seq = seq.join(seq)
phone['phoneDetails'] = seq
yield phone
但是,代码在运行时给出了错误:
File "C:\Users\ajhavery\Desktop\gsmarena_data\gsmarena_data\spiders\test.py", line 26, in parse
sequenceNew = sequenceNew.join(seq)
exceptions.MemoryError:
我们的想法是以下列格式获取数据:
表行标题:各自的数据,表行标题:各自的数据,....
同样如下所示:
Network Technology: GSM / HSPA / LTE, Network 2G bands: GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2, Battery Stand-by: Up to 598 h (2G) / Up to 626 h (3G), Battery Talk time: Up to 23 h (2G) / Up to 13 h (3G),
更新1:
使用@alecxe建议的代码:
def parse(self, response):
# extract whatever stuffs you want and yield items here
phone = gsmArenaDataItem()
details = []
for tableRows in response.css("div#specs-list table"):
phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
for ttl in tableRows.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
details.append('{title}: {info}'.format(title=ttl_value, info=nfo_value))
phone['phoneDetails'] = ", ".join(details)
yield phone
给出错误:
File "C:\Users\ajhavery\Desktop\gsmarena_data\gsmarena_data\spiders\test.py", line 22, in parse
details.append('{title}: {info}'.format(title=ttl_value, info=nfo_value))
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
答案 0 :(得分:1)
收集列表中的电话详情,并在循环后收集join()
:
def parse(self, response):
phone = gsmArenaDataItem()
details = []
for tableRows in response.css("div#specs-list table"):
phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
for ttl in tableRows.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
details.append('{title}: {info}'.format(title=ttl_value.encode("utf-8"), info=nfo_value.encode("utf-8")))
phone['phoneDetails'] = ", ".join(details)
yield phone