我是scrapy的新手,这是我的第二只蜘蛛:
class SitenameScrapy(scrapy.Spider):
name = "sitename"
allowed_domains = ['www.sitename.com', 'sitename.com']
rules = [Rule(LinkExtractor(unique=True), follow=True)]
def start_requests(self):
urls = ['http://www.sitename.com/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_cat)
def parse_cat(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
if ('/category/' in link.url):
yield response.follow(link, self.parse_cat)
if ('/product/' in link.url):
yield response.follow(link, self.parse_prod)
def parse_prod(self, response):
pass
我的问题是,有时我会有像http://sitename.com/path1/path2/?param1=value1¶m2=value2
这样的链接,对我而言,param1并不重要,我想在response.follow
之前将其从网址中删除。我想我可以用regex
做到这一点,但我不确定这对于scrapy是'正确的方法'吗?也许我应该为此使用某种规则?