python Scrapy CrawlSpider:登录后规则不适用,只抓取第一页

时间:2014-05-28 04:53:32

标签: python scrapy scrapy-spider

我是一名C / C ++程序员,在制作绘图和文本处理方面具有有限的python经验。我目前正在开展个人数据分析项目,并且我正在使用Scrapy在论坛中抓取所有主题和用户信息。

我已经整理了一个初始代码,目的是首先登录,然后从子索引的索引页面开始,执行以下操作:

1)提取包含" topic"

的所有主题链接

2)暂时将页面保存在文件中(一旦整个过程有效,将提取项目信息)

3)找到标签为class = next的下一页链接,转到下一页并重复1)和2)

我知道每个帖子,我仍然需要查看包含所有回复帖子的所有页面,但是我计划在我的当前代码工作正确后执行此操作。

但是,我当前的代码只会提取起始网址中的所有线程,然后停止。我已经搜了几个小时,但没有找到任何解决办法。所以我在这里问我的问题,希望有Scrapy经验的人可以帮助我。如果你们想要任何其他信息,如输出,请告诉我。谢谢!

关于Paul的回复,我更新了我的代码,我的链接提取器出了问题,我需要修复它。除此之外,规则现在正常。再次感谢Paul的帮助。

这是我目前的蜘蛛代码:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector

class ZhuaSpider(CrawlSpider):
    name = 'zhuaspider'
    allowed_domains = ['depressionforums.org']
    login_page = 'http://www.domain.com/forums/index.php?app=core&module=global&section=login'
    start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/']

    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True),
                           callback='parse_links',
                           follow=True),
            )

    def start_requests(self):
        """called before crawling starts. Try to login"""
        yield Request(
                url=self.login_page,
                callback=self.login,
                dont_filter=True)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                formdata={'ips_username': 'myuid', 'ips_password': 'mypwd'},
                callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are successfully logged in."""
        if "Username or password incorrect" in response.body:
            self.log("Login failed.")
        else:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin.
            for url in self.start_urls:
                # explicitly ask Scrapy to run the responses through rules
                yield Request(url, callback=self.parse)

    def parse_links(self, response):
        hxs = Selector(response)
        links = hxs.xpath('//a[contains(@href, "topic")]')
        for link in links:
            title = ''.join(link.xpath('./@title').extract())
            url = ''.join(link.xpath('./@href').extract())
            meta={'title':title,}
            yield Request(url, callback = self.parse_posts, meta=meta,)

    #If I add this line it will only crawl the starting url,
    #otherwise it still won't apply the rule and crawls nothing.
    parse_start_url = parse_links

    def parse_posts(self, response):
        filename = 'download/'+ response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

1 个答案:

答案 0 :(得分:2)

要使用CrawlSpider的{​​{1}},您需要使用内部Rules方法处理Requests

您可以通过明确设置parse()或不设置回调来实现此目的。

callback=self.parse

然后,仅凭此,您应该会看到start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/'] rules = ( Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True), callback='parse_links', follow=True), ) ... def check_login_response(self, response): """Check the response returned by a login request to see if we are successfully logged in.""" if "Username or password incorrect" in response.body: self.log("Login failed.") else: self.log("Successfully logged in. Let's start crawling!") # Now the crawling can begin. for url in self.start_urls: # explicitly ask Scrapy to run the responses through rules yield Request(url, callback=self.parse) 部分中的链接页面被抓取并使用//li[@class="next"]进行解析...期待start_urls本身。

要通过parse_links()获取start_urls,您必须定义一个特殊的parse_start_url属性。

你可以这样做:

parse_links
相关问题