我是一名C / C ++程序员,在制作绘图和文本处理方面具有有限的python经验。我目前正在开展个人数据分析项目,并且我正在使用Scrapy在论坛中抓取所有主题和用户信息。
我已经整理了一个初始代码,目的是首先登录,然后从子索引的索引页面开始,执行以下操作:
1)提取包含" topic"
的所有主题链接2)暂时将页面保存在文件中(一旦整个过程有效,将提取项目信息)
3)找到标签为class = next的下一页链接,转到下一页并重复1)和2)
我知道每个帖子,我仍然需要查看包含所有回复帖子的所有页面,但是我计划在我的当前代码工作正确后执行此操作。
但是,我当前的代码只会提取起始网址中的所有线程,然后停止。我已经搜了几个小时,但没有找到任何解决办法。所以我在这里问我的问题,希望有Scrapy经验的人可以帮助我。如果你们想要任何其他信息,如输出,请告诉我。谢谢!
关于Paul的回复,我更新了我的代码,我的链接提取器出了问题,我需要修复它。除此之外,规则现在正常。再次感谢Paul的帮助。
这是我目前的蜘蛛代码:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector
class ZhuaSpider(CrawlSpider):
name = 'zhuaspider'
allowed_domains = ['depressionforums.org']
login_page = 'http://www.domain.com/forums/index.php?app=core&module=global§ion=login'
start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/']
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True),
callback='parse_links',
follow=True),
)
def start_requests(self):
"""called before crawling starts. Try to login"""
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'ips_username': 'myuid', 'ips_password': 'mypwd'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are successfully logged in."""
if "Username or password incorrect" in response.body:
self.log("Login failed.")
else:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin.
for url in self.start_urls:
# explicitly ask Scrapy to run the responses through rules
yield Request(url, callback=self.parse)
def parse_links(self, response):
hxs = Selector(response)
links = hxs.xpath('//a[contains(@href, "topic")]')
for link in links:
title = ''.join(link.xpath('./@title').extract())
url = ''.join(link.xpath('./@href').extract())
meta={'title':title,}
yield Request(url, callback = self.parse_posts, meta=meta,)
#If I add this line it will only crawl the starting url,
#otherwise it still won't apply the rule and crawls nothing.
parse_start_url = parse_links
def parse_posts(self, response):
filename = 'download/'+ response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
答案 0 :(得分:2)
要使用CrawlSpider
的{{1}},您需要使用内部Rules
方法处理Requests
。
您可以通过明确设置parse()
或不设置回调来实现此目的。
callback=self.parse
然后,仅凭此,您应该会看到start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/']
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True),
callback='parse_links',
follow=True),
)
...
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are successfully logged in."""
if "Username or password incorrect" in response.body:
self.log("Login failed.")
else:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin.
for url in self.start_urls:
# explicitly ask Scrapy to run the responses through rules
yield Request(url, callback=self.parse)
部分中的链接页面被抓取并使用//li[@class="next"]
进行解析...期待start_urls本身。
要通过parse_links()
获取start_urls,您必须定义一个特殊的parse_start_url
属性。
你可以这样做:
parse_links