如何使用Scrapy解析源代码的两个不同部分并合并结果?

时间:2013-12-11 15:20:04

标签: python web-scraping scrapy

我目前正在运行两个蜘蛛来刮一页。蜘蛛在下面显示为标题和细节。我已经设置了这样的,因为我不知道如何设置查询的开头(在这种情况下,名为listings的变量),以允许我先抓取//div[@class='patio-head']然后{{ 1}}只需一步。有人可以帮助我,因为我想为每个网址返回//div[@class='patio-details']和所有相应的详细信息吗?谢谢!

标题

Name

详细

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from PatioDetail.items import PatioItem

class MySpider(BaseSpider):
    name = "PDSHeader"
    allowed_domains = ["http://patios.blogto.com/"]
    start_urls = ["http://patios.blogto.com/patio/25-liberty-toronto/", "http://patios.blogto.com/patio/3030-dundas-west-toronto/", 
"http://patios.blogto.com/patio/3-speed/", "http://patios.blogto.com//patio/7numbers/"]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    listings = hxs.select("//div[@class='patio-head']")
items = []
    for listings in listings:
        item = PatioItem()
        item ["Name"] = listings.select("div[@class='patio-head-details']/div[@class='patio-name']/h2[@class='name']/text()").extract()
        items.append(item)
    return items 

1 个答案:

答案 0 :(得分:1)

您想要的两个部分位于同一页面上。你唯一要做的就是获取页面并解析它以获取两个部分的数据,而不是两次获取并解析两次。
在编写蜘蛛之前,您应该花一些时间来分析您想要获取的网页的结构。

代码示例如下:

def parse(self, response):
    hxs = HtmlXPathSelector(response)

    item = PatioItem()
    item['Name'] = hxs.select("//div[@class='patio-name']/h2/text()").extract()[0]
    node_type = hxs.select("//ul[@class='detail-lister']/li[@class='type-icon']")
    item['Type'] = node_type.select(".//span[@class='detail-desc']/text()").extract()[0]
    node_covered = hxs.select("//ul[@class='detail-lister']/li[@class='covered-icon']")
    item['Covered'] = node_covered.select(".//span[@class='detail-desc']/text()").extract()[0]
    node_heated = hxs.select("//ul[@class='detail-lister']/li[@class='heated-icon']")
    item['Heated'] = node_heated.select(".//span[@class='detail-desc']/text()").extract()[0]
    node_capacity = hxs.select("//ul[@class='detail-lister']/li[@class='capacity-icon last']")
    item['Capacity'] = node_capacity.select(".//span[@class='detail-desc']/text()").extract()[0]

    return [item,]

这是关于Xpath的教程。这对你有帮助:))