这是我的蜘蛛。 执行蜘蛛后,什么都没有出现。 我不知道为什么它是空白的。 我觉得问题是收益和回报,但我不知道如何改变它。
import scrapy
from scrapy.http import Request
from scrapy import Selector
from CSDNBlog1.items import Csdnblog1Item
class CSDNBlogSpider(scrapy.Spider):
name='CSDNBlog1'
download_delay=1
allowed_domains=['blog.csdn.net']
starts_urls=['http://blog.csdn.net/u012150179/article/details/117490171']
def parse(self,response):
sel=Selector(response)
items=[]
item=Csdnblog1Item()
aricle_url=str(response.url)
article_name=sel.xpath('//div[@id="article_details"]/div/h1/span/a/text()').extract()
item['article_name']=[n.encode('utf-8') for n in article_name]
item['article_url']=article_url.encode('utf-8')
yield item
urls=sel.xpath('//li[@class="next_article"]/a/@href').extract()
for url in urls:
print(url)
url="http://blog.csdn.net"+url
print(url)
yield Request(url,callback=self.parse)
这是我的蜘蛛情况,它什么也没做。
2017-02-06 15:35:46 [scrapy] INFO: Spider opened
2017-02-06 15:35:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
答案 0 :(得分:0)
In [1]: fetch('http://blog.csdn.net/u012150179/article/details/117490171')
2017-02-06 16:41:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://blog.csdn.net/error/404.html?from=http%3a%2f%2fblog.csdn.net%2fu012150179%2farticle%2fdetails%2f117490171> from <GET http://blog.csdn.net/u012150179/article/details/117490171>
2017-02-06 16:41:51 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://blog.csdn.net/error/404.html?from=http%3a%2f%2fblog.csdn.net%2fu012150179%2farticle%2fdetails%2f117490171> (referer: None)
确保你做对start_url
答案 1 :(得分:0)
你的start_url不再正常工作,并且yield语句位置错误。 请尝试使用以下代码:
import scrapy
from CSDNBlog1.items import Csdnblog1Item
from scrapy import Selector
class Csdnblog1Spider(scrapy.Spider):
name = "CSDNBlog2"
download_delay=1
allowed_domains = ["http://blog.csdn.net/"]
start_urls = ['http://blog.csdn.net/u012150179/article/details/37306629/']
def parse(self, response):
item=Csdnblog1Item()
sel=Selector(response)
item['project'] = self.settings.get('BOT_NAME')
article_url=str(response.url)
article_name=sel.xpath('//div[@id="article_details"]/div/h1/span/a/text()').extract()
item['article_name']=[n.encode('utf-8') for n in article_name]
item['article_url']=article_url.encode('utf-8')
urls=sel.xpath('//li[@class="next_article"]/a/@href').extract()
for url in urls:
print(url)
url="http://blog.csdn.net"+url
print(url)
yield scrapy.Request(url,callback=self.parse)
return item