我不确定为什么我的脚本在刮板外壳程序中无法正常工作。我想解析列出的列,然后使用脚本将数据输出到外部json文件。
我已经在易碎外壳中进行了测试,并获得了成功的结果。但是,我的脚本失败了。
S壳测试:
scrapy shell https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1
>>> response
<200 https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1>
>>> table = response.xpath('//*[@class="wikitable sortable zebra"]//tr')
>>> table.xpath('td//text()')[3].extract()
u' pile_of_chocobo_bedding '
脚本失败的地方:
import scrapy
class BootstrapTableSpider(scrapy.Spider):
name = "bootstrap_table"
def start_requests(self):
urls = [
'https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.xpath('//*[@class="wikitable sortable zebra"]//tr'):
yield {
'id' : row.xpath('td//text()')[0].extract(),
'name': row.xpath('td//text()')[3].extract(),
'stackable': row.xpath('td//text()')[5].extract(),
'category': row.xpath('td//text()')[9].extract(),
'vendor_price': row.xpath('td//text()')[11].extract()
}
数据已解析并导出到json文件
答案 0 :(得分:0)
在带有标题的表的第一行上失败。 tr
仅包含th
,而没有td
,这就是错误为IndexError: list index out of range
的原因。为了避免这种情况,只需跳过包含空数据的行,如下所示:
import scrapy
class BootstrapTableSpider(scrapy.Spider):
name = "bootstrap_table"
start_urls = ['https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1']
def parse(self, response):
for row in response.xpath('//*[@class="wikitable sortable zebra"]//tr'):
data = row.xpath('td//text()').extract()
if not data: # pay attention how we skip empty row here
continue
yield {
'id': data[0],
'name': data[3],
'stackable': data[5],
'category': data[9],
'vendor_price': data[11]
}
输出:
...
2019-04-30 08:48:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-30 08:48:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1> (referer: None) ['cached']
2019-04-30 08:48:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1>
{'category': u' #N/A ', 'stackable': u' 1 ', 'vendor_price': u' 198\n', 'id': u' 1 ', 'name': u' pile_of_chocobo_bedding '}
2019-04-30 08:48:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1>
{'category': u' Furnishings ', 'stackable': u' 1 ', 'vendor_price': u' 391\n', 'id': u' 2 ', 'name': u' simple_bed '}
2019-04-30 08:48:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1>
{'category': u' Furnishings ', 'stackable': u' 1 ', 'vendor_price': u' 1403\n', 'id': u' 3 ', 'name': u' oak_bed '}
2019-04-30 08:48:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1>
{'category': u' Furnishings ', 'stackable': u' 1 ', 'vendor_price': u' 10100\n', 'id': u' 4 ', 'name': u' mahogany_bed '}
2019-04-30 08:48:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1>
{'category': u' Furnishings ', 'stackable': u' 1 ', 'vendor_price': u' 1564\n', 'id': u' 5 ', 'name': u' bronze_bed '}
2019-04-30 08:48:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1>
{'category': u' Furnishings ', 'stackable': u' 1 ', 'vendor_price': u' 12406\n', 'id': u' 6 ', 'name': u' nobles_bed '}
2019-04-30 08:48:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wiki.dspt.info/index.php/Basic_Item_IDs_Page_1>
{'category': u' #N/A ', 'stackable': u' 1 ', 'vendor_price': u' 0\n', 'id': u' 7 ', 'name': u' gold_bed '}
...