我想编写自定义scrapy链接提取器来提取链接。
scrapy文档说它有两个内置的提取器。
http://doc.scrapy.org/en/latest/topics/link-extractors.html
但我还没有看到任何代码示例如何通过自定义链接提取器实现,有人可以给出一些编写自定义提取器的示例吗?
答案 0 :(得分:6)
这是自定义链接提取器的示例
class RCP_RegexLinkExtractor(SgmlLinkExtractor):
"""High performant link extractor"""
def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
if base_url is None:
base_url = urljoin(response_url, self.base_url) if self.base_url else response_url
clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()
links_text = linkre.findall(response_text)
urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])
return [Link(url, text) for url, text in urlstext]
用法
rules = (
Rule(
RCP_RegexLinkExtractor(
allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
# Regex explanation:
# [a-z]{2} - matches a two character state abbreviation
# [a-z]* - matches a state name
# [0-9]{4} - matches a 4 number unique webpage identifier
allow_domains=('realclearpolitics.com',),
),
callback='parseStatePolls',
# follow=None, # default
process_links='processLinks',
process_request='processRequest',
),
)
答案 1 :(得分:2)
我很难找到最近的例子,所以我决定发布编写自定义链接提取器的过程。
我在浏览网站时遇到了问题,该网站的href网址包含空格,制表符和换行符,如下所示:
<a href="
/something/something.html
" />
假设有此链接的页面位于:
http://example.com/something/page.html
而不是将此href网址转换为:
http://example.com/something/something.html
Scrapy将其转化为:
这导致了一个无限循环,因为爬虫会越来越深入地解释那些糟糕解释的网址。
我试图使用process_value
的{{1}}参数process_links
,如同建议here一样没有运气,所以我决定修补处理相对网址的方法。
在当前版本的Scrapy(1.0.3)中,推荐的链接提取器是LxmlLinkExtractor
。
如果您想扩展LxmlLinkExtractor
,您应该查看代码在您使用的Scrapy版本上的运行方式。
您可以通过从命令行(在OS X上)运行来打开当前使用的scrapy代码位置:
LxmlLinkExtractor
在我使用的版本(1.0.3)中,open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')
的代码位于:
LxmlLinkExtractor
在那里,我看到我需要调整的方法是scrapy/linkextractors/lxmlhtml.py
_extract_links()
,然后由LxmlParserLinkExtractor
使用。
所以我通过名为LxmlLinkExtractor
和LxmlLinkExtractor
的略微修改的类扩展了LxmlParserLinkExtractor
和CustomLinkExtractor
。我修改的单行被注释掉了。
CustomLxmlParserLinkExtractor
在定义规则时,我使用# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")
# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):
def _extract_links(self, selector, response_url, response_encoding, base_url):
links = []
for el, attr, attr_val in self._iter_links(selector._root):
# Original method was:
# attr_val = urljoin(base_url, attr_val)
# So I just added a .strip()
attr_val = urljoin(base_url, attr_val.strip())
url = self.process_attr(attr_val)
if url is None:
continue
if isinstance(url, unicode):
url = url.encode(response_encoding)
# to fix relative links after process_value
url = urljoin(response_url, url)
link = Link(url, _collect_string_content(el) or u'',
nofollow=True if el.get('rel') == 'nofollow' else False)
links.append(link)
return unique_list(links, key=lambda link: link.url) \
if self.unique else links
# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href',), canonicalize=True,
unique=True, process_value=None, deny_extensions=None, restrict_css=()):
tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
tag_func = lambda x: x in tags
attr_func = lambda x: x in attrs
# Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
unique=unique, process=process_value)
super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
allow_domains=allow_domains, deny_domains=deny_domains,
restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
canonicalize=canonicalize, deny_extensions=deny_extensions)
:
CustomLinkExtractor
答案 2 :(得分:0)
我也发现了LinkExtractor示例 https://github.com/geekan/scrapy-examples 和 https://github.com/mjhea0/Scrapy-Samples
(人们在上面的链接中找不到所需信息后编辑)
更准确地说是https://github.com/geekan/scrapy-examples/search?utf8=%E2%9C%93&q=linkextractors&type=Code和https://github.com/mjhea0/Scrapy-Samples/search?utf8=%E2%9C%93&q=linkextractors