重写此代码可以使我正在寻找的帮助更加清晰。我正在尝试抓取this
之类的搜索结果页面 http://search.people.com.cn/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0
但是当我在Scrapy中运行它时,请求似乎已被重定向:
2020-01-10 09:55:38 [scrapy.downloadermiddlewares.redirect]调试:将http://search.people.com.cn/cnpeople/news/getNewsResult.jsp重定向(302)从http: //search.people.com.cn/cnpeople/search.do?pageNum=7&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0>
然后什么都不会刮掉。
这是网站将我重定向到结果列表的唯一方式,还是试图阻止我抓取结果?有什么我可以做的吗?
下面是我的蜘蛛代码:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "RMW"
def start_requests(self):
# starturls = ['http://search.people.com.cn/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0',]
numbers = list(range(1, 10, 1))
for num in numbers:
url = 'http://search.people.com.cn/cnpeople/search.do?pageNum='+str(num)+'&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0'
urls = []
urls.append(url)
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for link in response.css("ul"):
yield {
'link': link.css("a::attr(href)").get()
}
我真的很感谢任何对此领域有更多专业知识的人提供的解决方案。