我正在尝试搜索某个特定HTML代码的网站,并将数据导出到csv文件中。导出的代码中包含正则表达式和字符代码,每个单元格都包含在['']中。以下是一些导出数据的示例。
[u'<td colspan="2"><b><big>Universal Universal<br>3 \xbd" ID. to 4"OD. Adapter T409<br><br></big></b><table cellpadding="0" cellspacing="0" style="width: 300px; float:\nright; margin-right: 5px; border: 0px white solid; text-align:\ncenter;"><tr><td style="text-align: center;"><a href="products/images/med/UA1007.jpg" rel="thumbnail" title="UA1007"><img src="products/images/thumbs/UA1007.jpg" width="300px" align="right" style="border: 5px outset #333333;"></a></td></tr><tr><td style="text-align: center;"><table cellpadding="0" cellspacing="0" style="border: 0px solid white; width:\n300px; margin-left: auto; margin-right: auto;"><tr><td style="width: 33%; text-align: center;"></td><td style="width: 34%; text-align: center;"></td><td style="width: 33%; text-align: center;"></td></tr><tr><td></td><td></td><td></td></tr></table></td></tr></table>UA1007<br>\n3 1/2" ID to 4" OD, 7" Length <br>\nFits all pickup models<br><br>\nNow you can hook-up to your MBRP 4" and 5" hardware no matter what size your system. This adaptor is built from T409 stainless steel.<br><br><table><tr></tr></table></td>']
这是我用于蜘蛛的代码。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from MBRP.items import MbrpItem
class MBRPSpider(BaseSpider):
name = "MBRP"
allowed_domains = ["mbrpautomotive.com"]
start_urls = [
"http://www.mbrpautomotive.com/?page=products&part=B1410"
#* thats just one of the url's I have way more in this list *
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html')
items = []
for site in sites:
item = MbrpItem()
item['desc'] = site.select('//td[@colspan="2"]').extract()
item['PN'] = site.select('//b/big/a').extract()
items.append(item)
return items
这是我正在使用的代码。
import csv
class MBRPExporter(object):
def __init__(self):
self.MBRPCsv = csv.writer(open('output.csv', 'wb'))
self.MBRPCsv.writerow(['desc', 'PN'])
def process_item(self, item, spider):
self.MBRPCsv.writerow([item['desc'], item['PN']])
return item
我尝试使用像这样的管道代码,认为utf-8中的编码会有所帮助,但这给了我一个错误exceptions.AttributeError: 'XPathSelectorList' object has no attribute 'encode'
。
import csv
class MBRPExporter(object):
def __init__(self):
self.MBRPCsv = csv.writer(open('output.csv', 'wb'))
self.MBRPCsv.writerow(['desc', 'PN'])
def process_item(self, item, spider):
self.MBRPCsv.writerow([item['desc'].encode('utf-8'), item['PN'].encode('utf-8')])
return item
我是否相信我需要以utf-8出口?如果是这样我怎么去做呢?或者是否有其他方法可以清除导出的数据?
答案 0 :(得分:0)
除非消费者需要,否则您不需要对csv输出进行编码。 extract()方法生成一个列表(XPathSelectorList列表):
site.select('//td[@colspan="2"]').extract()
并且您不能在列表中使用encode()。您可以加入列表,也可以在退回项目之前先拿一个:
item = MbrpItem()
item['desc'] = ' '.join(site.select('//td[@colspan="2"]').extract())
item['PN'] = join(site.select('//b/big/a').extract()[0])
items.append(item)
或者你可以使用Item Loaders:
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, TakeFirst
def parse(self, response):
l = XPathItemLoader(response=response, item=MbrpItem())
l.add_xpath('desc', '//td[@colspan="2"]', Join(' '))
l.add_xpath('PN', '//b/big/a', TakeFirst())
return l.load_item()