Question

我是一名C / C ++程序员，在制作绘图和文本处理方面具有有限的python经验。我目前正在开展个人数据分析项目，并且我正在使用Scrapy在论坛中抓取所有主题和用户信息。

我已经整理了一个初始代码，目的是首先登录，然后从子索引的索引页面开始，执行以下操作：

1）提取包含＆＃34; topic＆＃34;

的所有主题链接

2）暂时将页面保存在文件中（一旦整个过程有效，将提取项目信息）

3）找到标签为class = next的下一页链接，转到下一页并重复1）和2）

我知道每个帖子，我仍然需要查看包含所有回复帖子的所有页面，但是我计划在我的当前代码工作正确后执行此操作。

但是，我当前的代码只会提取起始网址中的所有线程，然后停止。我已经搜了几个小时，但没有找到任何解决办法。所以我在这里问我的问题，希望有Scrapy经验的人可以帮助我。如果你们想要任何其他信息，如输出，请告诉我。谢谢！

关于Paul的回复，我更新了我的代码，我的链接提取器出了问题，我需要修复它。除此之外，规则现在正常。再次感谢Paul的帮助。

这是我目前的蜘蛛代码：

from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector

class ZhuaSpider(CrawlSpider):
    name = 'zhuaspider'
    allowed_domains = ['depressionforums.org']
    login_page = 'http://www.domain.com/forums/index.php?app=core&module=global&section=login'
    start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/']

    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True),
                           callback='parse_links',
                           follow=True),
            )

    def start_requests(self):
        """called before crawling starts. Try to login"""
        yield Request(
                url=self.login_page,
                callback=self.login,
                dont_filter=True)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                formdata={'ips_username': 'myuid', 'ips_password': 'mypwd'},
                callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are successfully logged in."""
        if "Username or password incorrect" in response.body:
            self.log("Login failed.")
        else:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin.
            for url in self.start_urls:
                # explicitly ask Scrapy to run the responses through rules
                yield Request(url, callback=self.parse)

    def parse_links(self, response):
        hxs = Selector(response)
        links = hxs.xpath('//a[contains(@href, "topic")]')
        for link in links:
            title = ''.join(link.xpath('./@title').extract())
            url = ''.join(link.xpath('./@href').extract())
            meta={'title':title,}
            yield Request(url, callback = self.parse_posts, meta=meta,)

    #If I add this line it will only crawl the starting url,
    #otherwise it still won't apply the rule and crawls nothing.
    parse_start_url = parse_links

    def parse_posts(self, response):
        filename = 'download/'+ response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

Answer 1

要使用CrawlSpider的{{1}}，您需要使用内部Rules方法处理Requests。

您可以通过明确设置parse()或不设置回调来实现此目的。

callback=self.parse

然后，仅凭此，您应该会看到start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/'] rules = ( Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True), callback='parse_links', follow=True), ) ... def check_login_response(self, response): """Check the response returned by a login request to see if we are successfully logged in.""" if "Username or password incorrect" in response.body: self.log("Login failed.") else: self.log("Successfully logged in. Let's start crawling!") # Now the crawling can begin. for url in self.start_urls: # explicitly ask Scrapy to run the responses through rules yield Request(url, callback=self.parse)部分中的链接页面被抓取并使用//li[@class="next"]进行解析...期待start_urls本身。

要通过parse_links()获取start_urls，您必须定义一个特殊的parse_start_url属性。

你可以这样做：

parse_links

python Scrapy CrawlSpider：登录后规则不适用，只抓取第一页

1 个答案: