我的刮刀出了什么问题?

时间:2015-04-04 14:30:20

标签: python web-crawler scrapy scrapy-spider

我想通过进入页面页面来抓取agent_name的联系人详细信息。有时这个脚本会返回一个条目,有时不同的条目无法找出原因。

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class CompItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    data = scrapy.Field()


class criticspider(CrawlSpider):
    name = "comp"
    allowed_domains = ["iproperty.com.my"]
    start_urls = ["http://www.iproperty.com.my/property/searchresult.aspx?t=S&gpt=AR&st=&ct=&k=&pt=&mp=&xp=&mbr=&xbr=&mbu=&xbu=&lo=&wp=&wv=&wa=&ht=&au=&sby=&ns=1"]


    def parse(self, response):
        sites = response.xpath('.//*[@id="frmSaveListing"]/ul')
        items = []

        for site in sites:
            item = CompItem()
            item['title'] = site.xpath('.//li[2]/div[3]/div[1]/div[2]/p[1]/a/text()').extract()[0]
            item['link'] = site.xpath('.//li[2]/div[3]/div[1]/div[2]/p[1]/a/@href').extract()[0]
            if item['link']:
                if 'http://' not in item['link']:
                    item['link'] = urljoin(response.url, item['link'])
                yield scrapy.Request(item['link'],
                                     meta={'item': item},
                                     callback=self.anchor_page)

            items.append(item)

    def anchor_page(self, response):
        old_item = response.request.meta['item']

        old_item['data'] = response.xpath('.//*[@id="main-content3"]/div[1]/div/table/tbody/tr/td[1]/table/tbody/tr[3]/td/text()').extract()
        yield old_item

1 个答案:

答案 0 :(得分:0)

即使您在浏览器中打开起始网址并多次刷新页面,您也会获得不同的搜索结果。

无论如何,你的蜘蛛需要调整,因为它不会从页面中提取所有代理:

import scrapy
from urlparse import urljoin


class CompItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    data = scrapy.Field()


class criticspider(scrapy.Spider):
    name = "comp"

    allowed_domains = ["iproperty.com.my"]
    start_urls = ["http://www.iproperty.com.my/property/searchresult.aspx?t=S&gpt=AR&st=&ct=&k=&pt=&mp=&xp=&mbr=&xbr=&mbu=&xbu=&lo=&wp=&wv=&wa=&ht=&au=&sby=&ns=1"]


    def parse(self, response):
        agents = response.xpath('//li[@class="search-listing"]//div[@class="article-right"]')
        for agent in agents:
            item = CompItem()
            item['title'] = agent.xpath('.//a/text()').extract()[0]
            item['link'] = agent.xpath('.//a/@href').extract()[0]
            yield scrapy.Request(urljoin("http://www.iproperty.com.my", item['link']),
                                 meta={'item': item},
                                 callback=self.anchor_page)


    def anchor_page(self, response):
        old_item = response.request.meta['item']

        old_item['data'] = response.xpath('.//*[@id="main-content3"]//table//table//p/text()').extract()
        yield old_item

我已修复的内容:

  • 使用scrapy.Spider代替CrawlSpider
  • 修复了XPath表达式,使其遍历页面上的所有代理,按照链接并抓住代理的自我描述/促销