我目前正在运行两个蜘蛛来刮一页。蜘蛛在下面显示为标题和细节。我已经设置了这样的,因为我不知道如何设置查询的开头(在这种情况下,名为listings
的变量),以允许我先抓取//div[@class='patio-head']
然后{{ 1}}只需一步。有人可以帮助我,因为我想为每个网址返回//div[@class='patio-details']
和所有相应的详细信息吗?谢谢!
标题
Name
详细
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from PatioDetail.items import PatioItem
class MySpider(BaseSpider):
name = "PDSHeader"
allowed_domains = ["http://patios.blogto.com/"]
start_urls = ["http://patios.blogto.com/patio/25-liberty-toronto/", "http://patios.blogto.com/patio/3030-dundas-west-toronto/",
"http://patios.blogto.com/patio/3-speed/", "http://patios.blogto.com//patio/7numbers/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
listings = hxs.select("//div[@class='patio-head']")
items = []
for listings in listings:
item = PatioItem()
item ["Name"] = listings.select("div[@class='patio-head-details']/div[@class='patio-name']/h2[@class='name']/text()").extract()
items.append(item)
return items
答案 0 :(得分:1)
您想要的两个部分位于同一页面上。你唯一要做的就是获取页面并解析它以获取两个部分的数据,而不是两次获取并解析两次。
在编写蜘蛛之前,您应该花一些时间来分析您想要获取的网页的结构。
代码示例如下:
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = PatioItem()
item['Name'] = hxs.select("//div[@class='patio-name']/h2/text()").extract()[0]
node_type = hxs.select("//ul[@class='detail-lister']/li[@class='type-icon']")
item['Type'] = node_type.select(".//span[@class='detail-desc']/text()").extract()[0]
node_covered = hxs.select("//ul[@class='detail-lister']/li[@class='covered-icon']")
item['Covered'] = node_covered.select(".//span[@class='detail-desc']/text()").extract()[0]
node_heated = hxs.select("//ul[@class='detail-lister']/li[@class='heated-icon']")
item['Heated'] = node_heated.select(".//span[@class='detail-desc']/text()").extract()[0]
node_capacity = hxs.select("//ul[@class='detail-lister']/li[@class='capacity-icon last']")
item['Capacity'] = node_capacity.select(".//span[@class='detail-desc']/text()").extract()[0]
return [item,]
这是关于Xpath的教程。这对你有帮助:))