我正在尝试抓取网站并将结果保存并格式化为CSV文件。我可以保存文件,但有三个关于输出和格式的问题:
所有结果都位于一个单元格而不是多行。在列出项目时是否忘记使用命令,以便它们出现在列表中?
如何删除每个结果前的['u...
? (我搜索并了解了print
的方法,但没有return
)
有没有办法为某些项目结果添加文字? (例如,我可以在每个deallink结果的开头添加“http://groupon.com”吗?)
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from deals.items import DealsItem
class DealsSpider(BaseSpider):
name = "groupon.com"
allowed_domains = ["groupon.com"]
start_urls = [
"http://www.groupon.com/chicago/all",
"http://www.groupon.com/new-york/all"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="page_content clearfix"]')
items = []
for site in sites:
item = DealsItem()
item['deal1'] = site.select('//div[@class="c16_grid_8"]/a/@title').extract()
item['deal1link'] = site.select('//div[@class="c16_grid_8"]/a/@href').extract()
item['img1'] = site.select('//div[@class="c16_grid_8"]/a/img/@src').extract()
item['deal2'] = site.select('//div[@class="c16_grid_8 last"]/a/@title').extract()
item['deal2link'] = site.select('//div[@class="c16_grid_8 last"]/a/@href').extract()
item['img2'] = site.select('//div[@class="c16_grid_8 last"]/a/img/@src').extract()
items.append(item)
return items
答案 0 :(得分:2)
编辑:现在我更好地理解了这个问题。你的parse()函数应该更像下面这样吗?也就是说,yield
一次只能输入一个项目,而不是返回一个列表。我怀疑你要返回的列表是什么被填充错误格式化到一个单元格中。
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="page_content clearfix"]')
for site in sites:
item = DealsItem()
item['deal1'] = site.select('//div[@class="c16_grid_8"]/a/@title').extract()
item['deal1link'] = site.select('//div[@class="c16_grid_8"]/a/@href').extract()
item['img1'] = site.select('//div[@class="c16_grid_8"]/a/img/@src').extract()
item['deal2'] = site.select('//div[@class="c16_grid_8 last"]/a/@title').extract()
item['deal2link'] = site.select('//div[@class="c16_grid_8 last"]/a/@href').extract()
item['img2'] = site.select('//div[@class="c16_grid_8 last"]/a/img/@src').extract()
yield item
答案 1 :(得分:0)
查看Item管道文档:http://doc.scrapy.org/topics/item-pipeline.html
u'代表unicode编码。 http://docs.python.org/howto/unicode.html
>>> s = 'foo'
>>> unicode(s)
u'foo'
>>> str(unicode(s))
'foo'