我尝试了3种不同的LinkExtractor变体,但它仍然忽略了'deny'规则并在所有3种变体中抓取子域....我想从爬网中排除子域。
仅尝试使用'允许'规则。仅允许主域,例如example.edu.uk
rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
仅尝试'拒绝'规则。拒绝所有子域,即sub.example.edu.uk
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working
尝试同时使用'允许&amp;拒绝'规则
rules = [Rule(LinkExtractor(allow=(r'^http:\/\/example\.edu\.uk(\/.*)?$'),deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working
示例:
请按照以下链接
放弃子域名链接
以下是完整的代码......
class NewsFields(Item):
pagetype = Field()
pagetitle = Field()
pageurl = Field()
pagedate = Field()
pagedescription = Field()
bodytext = Field()
class MySpider(CrawlSpider):
name = 'profiles'
start_urls = ['http://www.example.edu.uk/listing']
allowed_domains = ['example.edu.uk']
rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"nkdpagetype"})
ptitle = soup.find_all(attrs={"name":"nkdpagetitle"})
pturl = soup.find_all(attrs={"name":"nkdpageurl"})
ptdate = soup.find_all(attrs={"name":"nkdpagedate"})
ptdesc = soup.find_all(attrs={"name":"nkdpagedescription"})
for node in soup.find_all("div", id="main-content__wrapper"):
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//p/a/@href').extract():
yield Request(response.urljoin(url), callback=self.parse)
有人可以帮忙吗? 感谢
答案 0 :(得分:1)
你的前两个规则是错误的
rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working
允许和拒绝是针对绝对网址而不是域名。以下应该适合你
rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )
修改-1 强>
首先你应该在下面改变
allowed_domains = ['example.edu.uk']
到
allowed_domains = ['www.example.edu.uk']
其次,提取网址的规则应为
rules = (Rule(LinkExtractor(allow=(r'^https?://www.example.edu.uk/.*', ))), )
第三,在您的以下代码中
for url in hxs.xpath('//p/a/@href').extract():
yield Request(response.urljoin(url), callback=self.parse)
不会应用规则。您的收益率受规则限制。规则将自动插入新请求,但它们不会阻止您产生规则配置不允许的其他链接。但设置allowed_domains
将适用于规则和收益