试图从网站dicksmith.com.au废弃数据

时间:2015-05-21 10:29:48

标签: python web-scraping scrapy

python 2.7.6,scrapy 0.24.6,website-dicksmith.com.au,OS- Linux(Ubuntu) 网址(移动网站很简单) - http://search.dicksmith.com.au/search?w=mobile+phone&ts=m

对不起,伙计们,我是scrapy的新手。提前致谢

代码:

import scrapy

class PriceWatchItem( scrapy.Item ):
    name = scrapy.Field()
    price = scrapy.Field()

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class PriceWatchSpider( CrawlSpider ):
    name = 'dicksmith'
    allowed_domains = ['dicksmith.com.au']
    start_urls = ['http://search.dicksmith.com.au/search']
    rules = [ Rule ( LinkExtractor( allow = ['?w=mobile+phone&ts=m']      ), 'parse_dickSmith' ) ]

    def parse_dickSmith( self, response ):
        dickSmith = PriceWatchItem()
        dickSmith['name'] = response.xpath("//h1/text()").extract()
        return dickSmith
  #scrapy crawl dicksmith -o scraped_data.jason

ERROR:

File "pricewatch.py", line 10, in <module>
    class PriceWatchSpider( CrawlSpider ):
  File "pricewatch.py", line 14, in PriceWatchSpider
    rules = [ Rule ( LinkExtractor( allow = ['?w=mobile+phone&ts=m'] ), 'parse_dickSmith' ) ]
  File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 94, in __init__
    deny_extensions)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractor.py", line 46, in __init__
    self.allow_res = [x if isinstance(x, _re_type) else re.compile(x) for x in arg_to_iter(allow)]
  File "/usr/lib/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat

1 个答案:

答案 0 :(得分:0)

你应该逃脱?和+

试试这个

reg = re.compile('\?w=mobile\+phone&ts=m')
rules = [ Rule ( LinkExtractor(allow = reg, 'parse_dickSmith' ) ]