scrapy爬网站

时间:2017-09-04 09:52:20

标签: python scrapy

您好我使用scrapy抓取网站新闻,但是当我执行此流程时我收到错误网站上有很多新闻页面,新闻网址是www.example.com/34223我我试图找到解决这个问题的方法,她的代码scrapy版本是1.4.0,我使用的是MACOS

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class Example(scrapy.Spider):
name = "example"
allowed_domains = ["http://www.example.com"]
start_urls = ["http://www.example.com"]
  rules = (
    #self.log('testing rules' + response.url)
      # Extract links matching 'category.php' (but not matching 'subsection.php')
      # and follow links from them (since no callback means follow=True by default).
      Rule(LinkExtractor(allow=('/*', ), deny=(' ', ))),

      # Extract links matching 'item.php' and parse them with the spider's method parse_item
      Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
  )

def parse_item(self, response):
    self.logger.info('Hi, this is an item page! %s', response.url)
     item = scrapy.Item()
      item['title'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[2]/text()').extract()
      item['img_url'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[3]/img').extract()
      item['description'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[5]/text()').extract()
      return item

2 个答案:

答案 0 :(得分:0)

感谢现在的工作,但我需要抛出所有的网站新闻

# -*- coding: utf-8 -*-
import scrapy

class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['www.Example.com']
start_urls = ['http://www.Example.com/1621305',
]

def parse(self, response):
    for article in response.css('.article'):
        yield {
        'title' : article.css('.article-title h1::text').extract(),
        'time'  : article.css('.article-time time::text').extract(),
        'article': article.css('.article-text p::text').extract(),
        }

答案 1 :(得分:0)

我确实修复了代码并且工作正常是的我做了这个

credentials = {
      'user'.encode("utf-8"): 'user'.encode("utf-8"),
      'pass'.encode("utf-8"): 'pass'.encode("utf-8")
}