我想使用Scrapy的LinkExtractor()仅关注.th域中的链接
我看到有一个deny_extensions(list)参数,但没有allow_extensions()参数。
鉴于此,如何限制链接只是为了允许域名.th?
答案 0 :(得分:0)
deny_extensions
将过滤掉以.gz
,.exe
结尾的网址。
您可能正在寻找allow_domains
:
allow_domains (str或list) - 单个值或包含域名的字符串列表,用于提取链接
deny_domains (str或list) - 单个值或包含不考虑用于提取链接的域的字符串列表
编辑:
我的评论中提到的另一个选项是使用自定义LinkExtractor
。
下面是这样一个链接提取器的例子,它与标准链接提取器做同样的事情,但另外过滤掉域名与Unix文件名模式不匹配的链接(它使用the fnmatch
module):
from six.moves.urllib.parse import urlparse
import fnmatch
import re
from scrapy.linkextractors import LinkExtractor
class DomainPatternLinkExtractor(LinkExtractor):
def __init__(self, domain_pattern, *args, **kwargs):
super(DomainPatternLinkExtractor, self).__init__(*args, **kwargs)
# take a Unix file pattern string and translate
# it to a regular expression to match domains against
regex = fnmatch.translate(domain_pattern)
self.reobj = re.compile(regex)
def extract_links(self, response):
return list(
filter(
lambda link: self.reobj.search(urlparse(link.url).netloc),
super(DomainPatternLinkExtractor, self).extract_links(response)
)
)
在您的情况下,您可以像这样使用它:DomainPatternLinkExtractor('*.th')
。
使用此链接提取器的示例scrapy shell会话:
$ scrapy shell http://www.dmoz.org/News/Weather/
2016-11-21 17:14:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: issue2401)
(...)
2016-11-21 17:14:52 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/News/Weather/> (referer: None)
>>> from six.moves.urllib.parse import urlparse
>>> import fnmatch
>>> import re
>>>
>>> from scrapy.linkextractors import LinkExtractor
>>>
>>>
>>> class DomainPatternLinkExtractor(LinkExtractor):
...
... def __init__(self, domain_pattern, *args, **kwargs):
... super(DomainPatternLinkExtractor, self).__init__(*args, **kwargs)
... regex = fnmatch.translate(domain_pattern)
... self.reobj = re.compile(regex)
... def extract_links(self, reponse):
... return list(
... filter(
... lambda link: self.reobj.search(urlparse(link.url).netloc),
... super(DomainPatternLinkExtractor, self).extract_links(response)
... )
... )
...
>>> from pprint import pprint
>>> pprint([l.url for l in DomainPatternLinkExtractor('*.co.uk').extract_links(response)])
['http://news.bbc.co.uk/weather/',
'http://freemeteo.co.uk/',
'http://www.weatheronline.co.uk/']
>>> pprint([l.url for l in DomainPatternLinkExtractor('*.gov*').extract_links(response)])
['http://www.metoffice.gov.uk/', 'http://www.weather.gov/']
>>> pprint([l.url for l in DomainPatternLinkExtractor('*.name').extract_links(response)])
['http://www.accuweather.name/']