Question

我想从Torrenting.com抓取TOP TEN RELEASES表，我已经为此目的制作了一个爬虫，但您首先需要登录该站点。我已经删除的初始数据基本上没什么，所以我开始重建我的torrent_spider.py用于此目的，因为我是网络抓取的新手，我遇到了这个问题。

我正在阅读Scrapy文档，我发现start_requests()将帮助我连接到torrenting并开始抓桌子。

我的问题是，有人可以向我解释如何在我的蜘蛛登录后返回https://www.torrenting.com/browse.php页面，这样我就可以开始抓取想要的数据了。

这是torrent_spider.py：

from scrapy import Spider
from scrapy.selector import Selector


class TorrentSpider(Spider):
    """ TorrentSpider who will Scrape the Top Then Relese Table. """
    name = "torrenting"
    allowed_domains = ["torrenting.com"]
    start_urls = [
        "https://www.torrenting.com/browse.php",
    ]

    def start_request(self):
        return [scrapy.FormRequest("https://www.torrenting.com/login.php?returnto=Login",
                                    formdata={'user': 'example', 'pass': 'somepass'},
                                    callback = self.logged_in)

    def logged_in(self, response):
        pass


    def parse(self, response):
        pass

有争议的scrapy start_requests（）的正确方法

0 个答案: