是否可以将Scrapy蜘蛛配置为忽略访问过的网址中的网址参数,以便在访问过www.example.com/page?p=value2
后才会访问www.example.com/page?p=value1
?
答案 0 :(得分:4)
您无法对其进行配置,但根据documentation,您可以继承标准重复过滤器类并覆盖它的request_fingerprint
方法。
这未经过测试,但应该可行。第一个子类是标准重复过滤器类(例如dupefilters.py
):
from w3lib.url import url_query_cleaner
from scrapy.dupefilters import RFPDupeFilter
from scrapy.utils.request import request_fingerprint
class MyRFPDupeFilter(RFPDupeFilter):
def request_fingerprint(self, request):
new_request = request.replace(url=url_query_cleaner(request.url))
return request_fingerprint(new_request)
在DUPEFILTER_CLASS
:
settings.py
设置为您的班级
DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter'