我有一个可以完美收集信息的刮刀,但是当我尝试实施规则来抓取“下一页”时,我就陷入了困境。使用Scrapy 0.22(此时我无法升级)。
import re
import datetime
import dateutil
import urllib2
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from crawlers.spiders import BaseCrawler
class rappSpider(BaseCrawler):
name = "rapp"
base_url = "www.example.com"
start_urls = [
# "http://www.example.com/news-perspective",
# "http://www.example.com/news-perspective?f[0]=field_related_topics%3A31366",
"http://www.example/news-perspective?key=&page=%d"
]
# rules = [
# Rule(SgmlLinkExtractor(allow=r'?key=&page=[0-9]'), callback='get_article_links', follow= True)
# ]
TITLE_XPATH_SELECTOR= "//div[@id='inset-content']//h1/text()"
TEXT_XPATH_SELECTOR = "//div[@class='field-item even']/p/text()"
DATETIME_XPATH_SELECTOR = "//div[@class='field-items']/div/span/text()"
def get_article_links(self, response, *args, **kwargs):
html = Selector(response)
link_extractor = SgmlLinkExtractor(allow=('http://www.example.com/news-perspective/\d{4}/\d{2}\/*\S*$',))
is_relative_path = False
yield [link.url for link in link_extractor.extract_links(response)], is_relative_path
刮刀适用于像http://www.example/news-perspective这样的start_urls,它列出了页面上的一些文章,然后刮刀将遵循get_article_links定义的链接并获取相关信息。但是,我希望能够转到下一页(其他页面上的格式相同,网址为
http://www.example/news-perspective?key=&page=的#
如何使用现有代码进行设置?我需要两个单独的规则吗?或者我是否需要更改start_requests?
答案 0 :(得分:0)
在网站上,可能有一个“下一步”按钮链接到下一页。您应该包括与该链接匹配的规则。