我正在尝试从http://www.just-eat.co.uk/belfast-takeaway网页中删除餐馆的名称和地址。到目前为止,我的csv输出在一行上有所有名称,在一行上有所有地址。我试图为每个名称获得一行,每个地址获得一行。
下面是我的蜘蛛:
import scrapy
from justeat.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["just-eat.co.uk"]
start_urls = ["http://www.just-eat.co.uk/belfast-takeaway",]
def parse(self, response):
for sel in response.xpath('//*[@id="searchResults"]'):
item = DmozItem()
item['name'] = sel.xpath('//*[@itemprop="name"]').extract()
item['address'] = sel.xpath('//*[@class="address"]').extract()
yield item
以下是我的项目:
import scrapy
class DmozItem(scrapy.Item):
name = scrapy.Field()
address = scrapy.Field()
然后我用
scrapy crawl dmoz -o items.csv
运行我的代码。
有人能用我的编码把我放在正确的道路上吗?
答案 0 :(得分:1)
你走了:))
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from justeat.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["just-eat.co.uk"]
start_urls = ["http://www.just-eat.co.uk/belfast-takeaway", ]
def parse(self, response):
for sel in response.xpath('//*[@id="searchResults"]'):
names = sel.xpath('//*[@itemprop="name"]/text()').extract()
names = [name.strip() for name in names]
addresses = sel.xpath('//*[@class="address"]/text()').extract()
addresses = [address.strip() for address in addresses]
result = zip(names, addresses)
for name, address in result:
item = DmozItem()
item['name'] = name
item['address'] = address
yield item