链接提取错误

时间:2016-03-14 18:51:46

标签: python scrapy

我的目标是在不同的链接中提取特殊数据。例如,目标链接主页是http://www.hurriyetemlak.com/satilik-daire,我想从

收集价格值

http://www.hurriyetemlak.com/konut-satilik/istanbul-bahcelievler-bahcelievler-emlakcidan-apartman-dairesi/detay?sParam=T0CxxQ7yvMbCCAkDN0Behw==&new=1

或来自其他链接http://www.hurriyetemlak.com/konut-satilik/ankara-cankaya-yasamkent-emlakcidan-apartman-dairesi/detay?sParam=iM12IpDxQ9JOLFTGIwQMKg==&new=1

My code is like that :

import scrapy
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class deneme01(CrawlSpider):
    name = 'hurriyetemlak'
    allowed_domains = ['hurriyetemlak.com']
    start_urls = ['http://www.hurriyetemlak.com/satilik-daire']
    Rule = (LinkExtractor(restrict_xpaths=('//ul[@id="reality-list"]//li[@onmouseover="show(this);"]')),callback='parse_item')


    def parse_item(self,response):
                            item = scrapy.Item()
                            item['price']=response.selector.xpath('//li[@class="price-lineclearfix"]/text()').extract()
                            yield item

但是我接受了语法错误。我无法弄清楚它为什么会发生。我只是应用规则的功能。

1 个答案:

答案 0 :(得分:0)

请在发生语法错误时发布回溯。

为了帮助其他读者,如果他们有类似的问题,我会粘贴你的蜘蛛从scrapy runspider给我的东西(我省略了回调):

$ cat test001.py 
import scrapy
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class deneme01(CrawlSpider):
    name = 'hurriyetemlak'
    allowed_domains = ['hurriyetemlak.com']
    start_urls = ['http://www.hurriyetemlak.com/satilik-daire']
    Rule = (LinkExtractor(restrict_xpaths=('//ul[@id="reality-list"]//li[@onmouseover="show(this);"]')),callback='parse_item')

$ scrapy runspider test001.py 
2016-03-15 11:24:25 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-15 11:24:25 [scrapy] INFO: Optional features available: ssl, http11
2016-03-15 11:24:25 [scrapy] INFO: Overridden settings: {}
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy10/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/scrapy/commands/runspider.py", line 80, in run
    module = _import_file(filename)
  File "/home/paul/.virtualenvs/scrapy10/local/lib/python2.7/site-packages/scrapy/commands/runspider.py", line 20, in _import_file
    module = import_module(fname)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/home/paul/scrapinghub/scrapy/stackoverflow/35995711/test001.py", line 11
    Rule = (LinkExtractor(restrict_xpaths=('//ul[@id="reality-list"]//li[@onmouseover="show(this);"]')),callback='parse_item')
                                                                                                                ^
SyntaxError: invalid syntax

您可以使用scrapy docs作为正确语法的示例。

CrawlSpider页面有this example

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

所以你应该:

  • 拥有rules属性
  • rulesRule个实例
  • 的列表或元组

您的蜘蛛具有更正的语法(我在运行时没有测试过行为)

$ cat test002.py 
import scrapy
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class deneme01(CrawlSpider):
    name = 'hurriyetemlak'
    allowed_domains = ['hurriyetemlak.com']
    start_urls = ['http://www.hurriyetemlak.com/satilik-daire']
    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//ul[@id="reality-list"]//li[@onmouseover="show(this);"]')),
             callback='parse_item'),
    )