Scrapy:Linkextractor规则不起作用

时间:2017-08-27 16:38:05

标签: python scrapy

我尝试了3种不同的LinkExtractor变体,但它仍然忽略了'deny'规则并在所有3种变体中抓取子域....我想从爬网中排除子域。

仅尝试使用'允许'规则。仅允许主域,例如example.edu.uk

rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working

仅尝试'拒绝'规则。拒绝所有子域,即sub.example.edu.uk

rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

尝试同时使用'允许&amp;拒绝'规则

rules = [Rule(LinkExtractor(allow=(r'^http:\/\/example\.edu\.uk(\/.*)?$'),deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

示例:

请按照以下链接

  • example.edu.uk/fsdfs.htm
  • example.edu.uk/nkln.htm
  • example.edu.uk/vefr.htm
  • example.edu.uk/opji.htm

放弃子域名链接

  • sub-domain.example.edu.uk/fsdfs.htm
  • sub-domain.example.edu.uk/nkln.htm
  • sub-domain.example.edu.uk/vefr.htm
  • sub-domain.example.edu.uk/opji.htm

以下是完整的代码......

class NewsFields(Item):
    pagetype = Field()
    pagetitle = Field()
    pageurl = Field()
    pagedate = Field()
    pagedescription = Field()
    bodytext = Field()

class MySpider(CrawlSpider):
    name = 'profiles'
    start_urls = ['http://www.example.edu.uk/listing']
    allowed_domains = ['example.edu.uk']
    rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )
    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"nkdpagetype"})
        ptitle = soup.find_all(attrs={"name":"nkdpagetitle"})
        pturl = soup.find_all(attrs={"name":"nkdpageurl"})
        ptdate = soup.find_all(attrs={"name":"nkdpagedate"})
        ptdesc = soup.find_all(attrs={"name":"nkdpagedescription"})
        for node in soup.find_all("div", id="main-content__wrapper"):
             ptbody = ''.join(node.find_all(text=True))
             ptbody = ' '.join(ptbody.split())
             nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
             nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
             nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
             nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
             nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
             nf['bodytext'] = ptbody.encode('ascii', 'ignore')
             yield nf
            for url in hxs.xpath('//p/a/@href').extract():
             yield Request(response.urljoin(url), callback=self.parse)

有人可以帮忙吗? 感谢

1 个答案:

答案 0 :(得分:1)

你的前两个规则是错误的

rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

允许和拒绝是针对绝对网址而不是域名。以下应该适合你

rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )

修改-1

首先你应该在下面改变

allowed_domains = ['example.edu.uk']

allowed_domains = ['www.example.edu.uk']

其次,提取网址的规则应为

rules = (Rule(LinkExtractor(allow=(r'^https?://www.example.edu.uk/.*', ))), )

第三,在您的以下代码中

for url in hxs.xpath('//p/a/@href').extract():
         yield Request(response.urljoin(url), callback=self.parse)

不会应用规则。您的收益率受规则限制。规则将自动插入新请求,但它们不会阻止您产生规则配置不允许的其他链接。但设置allowed_domains将适用于规则和收益