我试图从需要身份验证的网站上抓取数据 我已经能够使用请求和HttpNtlmAuth成功登录以下内容:
s = requests.session()
url = "https://website.com/things"
response = s.get(url, auth=HttpNtlmAuth('DOMAIN\\USERNAME','PASSWORD'))
我想探索Scrapy的功能,但是我无法成功验证。
我遇到了以下中间件,看起来它可以工作,但我认为我没有正确实现它:
https://github.com/reimund/ntlm-middleware/blob/master/ntlmauth.py
在我的settings.py中,我有
SPIDER_MIDDLEWARES = { 'test.ntlmauth.NtlmAuthMiddleware': 400, }
在我的蜘蛛类中我有
http_user = 'DOMAIN\\USER'
http_pass = 'PASS'
我无法让它发挥作用。
如果有人能够成功地从一个使用NTLM认证的网站上搜索,可以指出我正确的方向,我将不胜感激。
答案 0 :(得分:5)
我能够弄清楚发生了什么。
1:这被认为是“DOWNLOADER_MIDDLEWARE”而不是“SPIDER_MIDDLEWARE”。
DOWNLOADER_MIDDLEWARES = { 'test.ntlmauth.NTLM_Middleware': 400, }
2:我试图使用的中间件需要进行大量修改。这对我有用:
from scrapy.http import Response
import requests
from requests_ntlm import HttpNtlmAuth
class NTLM_Middleware(object):
def process_request(self, request, spider):
url = request.url
pwd = getattr(spider, 'http_pass', '')
usr = getattr(spider, 'http_user', '')
s = requests.session()
response = s.get(url,auth=HttpNtlmAuth(usr,pwd))
return Response(url,response.status_code,{}, response.content)
在蜘蛛内,你需要做的就是设置这些变量:
http_user = 'DOMAIN\\USER'
http_pass = 'PASS'
答案 1 :(得分:2)
感谢@SpaceDog上面的评论,我在尝试使用ntlm authentification抓取内部网网站时遇到了类似的问题。爬虫只会看到第一页,因为CrawlSpider中的LinkExtractor没有启动。
这是我使用scrapy 1.0.5的工作解决方案
<强> NTLM_Middleware.py 强>
from scrapy.http import Response, HtmlResponse
import requests
from requests_ntlm import HttpNtlmAuth
class NTLM_Middleware(object):
def process_request(self, request, spider):
url = request.url
usr = getattr(spider, 'http_usr', '')
pwd = getattr(spider, 'http_pass','')
s = requests.session()
response = s.get(url, auth=HttpNtlmAuth(usr,pwd))
return HtmlResponse(url,response.status_code, response.headers.iteritems(), response.content)
<强> settings.py 强>
import logging
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'scrapy intranet'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS=16
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'intranet.NTLM_Middleware.NTLM_Middleware': 200,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':None
}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline',
}
ELASTICSEARCH_SERVER='localhost'
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_USERNAME=''
ELASTICSEARCH_PASSWORD=''
ELASTICSEARCH_INDEX='intranet'
ELASTICSEARCH_TYPE='pages_intranet'
ELASTICSEARCH_UNIQ_KEY='url'
ELASTICSEARCH_LOG_LEVEL=logging.DEBUG
<强>蜘蛛/ intranetspider.py 强>
# -*- coding: utf-8 -*-
import scrapy
#from scrapy import log
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.http import Response
import requests
import sys
from bs4 import BeautifulSoup
class PageItem(scrapy.Item):
body=scrapy.Field()
title=scrapy.Field()
url=scrapy.Field()
class IntranetspiderSpider(CrawlSpider):
http_usr='DOMAIN\\user'
http_pass='pass'
name = "intranetspider"
protocol='https://'
allowed_domains = ['intranet.mydomain.ca']
start_urls = ['https://intranet.mydomain.ca/']
rules = (Rule(LinkExtractor(),callback="parse_items",follow=True),)
def parse_items(self, response):
self.logger.info('Crawl de la page %s',response.url)
item = PageItem()
soup = BeautifulSoup(response.body)
#remove script tags and javascript from content
[x.extract() for x in soup.findAll('script')]
item['body']=soup.get_text(" ", strip=True)
item['url']=response.url
return item