如何在scrapy python中编写自定义链接提取器

时间:2012-12-11 07:13:06

标签: python scrapy

我想编写自定义scrapy链接提取器来提取链接。

scrapy文档说它有两个内置的提取器。

http://doc.scrapy.org/en/latest/topics/link-extractors.html

但我还没有看到任何代码示例如何通过自定义链接提取器实现,有人可以给出一些编写自定义提取器的示例吗?

3 个答案:

答案 0 :(得分:6)

这是自定义链接提取器的示例

class RCP_RegexLinkExtractor(SgmlLinkExtractor):
    """High performant link extractor"""

    def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
        if base_url is None:
            base_url = urljoin(response_url, self.base_url) if self.base_url else response_url

        clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
        clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()

        links_text = linkre.findall(response_text)
        urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])

        return [Link(url, text) for url, text in urlstext]

用法

rules = (
    Rule(
        RCP_RegexLinkExtractor(
            allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
            # Regex explanation:
            #     [a-z]{2} - matches a two character state abbreviation
            #     [a-z]*   - matches a state name
            #     [0-9]{4} - matches a 4 number unique webpage identifier

            allow_domains=('realclearpolitics.com',),
        ),
        callback='parseStatePolls',
        # follow=None, # default 
        process_links='processLinks',
        process_request='processRequest',
    ),
)

在这里查看https://github.com/jtfairbank/RCP-Poll-Scraper

答案 1 :(得分:2)

我很难找到最近的例子,所以我决定发布编写自定义链接提取器的过程。

我之所以决定创建自定义链接提取器

我在浏览网站时遇到了问题,该网站的href网址包含空格,制表符和换行符,如下所示:

<a href="
       /something/something.html
         " />

假设有此链接的页面位于:

http://example.com/something/page.html

而不是将此href网址转换为:

http://example.com/something/something.html

Scrapy将其转化为:

http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20

这导致了一个无限循环,因为爬虫会越来越深入地解释那些糟糕解释的网址。

我试图使用process_value的{​​{1}}参数process_links,如同建议here一样没有运气,所以我决定修补处理相对网址的方法。

查找原始代码

在当前版本的Scrapy(1.0.3)中,推荐的链接提取器是LxmlLinkExtractor

如果您想扩展LxmlLinkExtractor,您应该查看代码在您使用的Scrapy版本上的运行方式。

您可以通过从命令行(在OS X上)运行来打开当前使用的scrapy代码位置:

LxmlLinkExtractor

在我使用的版本(1.0.3)中,open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"') 的代码位于:

LxmlLinkExtractor

在那里,我看到我需要调整的方法是scrapy/linkextractors/lxmlhtml.py _extract_links(),然后由LxmlParserLinkExtractor使用。

所以我通过名为LxmlLinkExtractorLxmlLinkExtractor的略微修改的类扩展了LxmlParserLinkExtractorCustomLinkExtractor。我修改的单行被注释掉了。

CustomLxmlParserLinkExtractor

在定义规则时,我使用# Import everything from the original lxmlhtml from scrapy.linkextractors.lxmlhtml import * _collect_string_content = etree.XPath("string()") # Extend LxmlParserLinkExtractor class CustomParserLinkExtractor(LxmlParserLinkExtractor): def _extract_links(self, selector, response_url, response_encoding, base_url): links = [] for el, attr, attr_val in self._iter_links(selector._root): # Original method was: # attr_val = urljoin(base_url, attr_val) # So I just added a .strip() attr_val = urljoin(base_url, attr_val.strip()) url = self.process_attr(attr_val) if url is None: continue if isinstance(url, unicode): url = url.encode(response_encoding) # to fix relative links after process_value url = urljoin(response_url, url) link = Link(url, _collect_string_content(el) or u'', nofollow=True if el.get('rel') == 'nofollow' else False) links.append(link) return unique_list(links, key=lambda link: link.url) \ if self.unique else links # Extend LxmlLinkExtractor class CustomLinkExtractor(LxmlLinkExtractor): def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(), tags=('a', 'area'), attrs=('href',), canonicalize=True, unique=True, process_value=None, deny_extensions=None, restrict_css=()): tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs)) tag_func = lambda x: x in tags attr_func = lambda x: x in attrs # Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func, unique=unique, process=process_value) super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths, restrict_css=restrict_css, canonicalize=canonicalize, deny_extensions=deny_extensions)

CustomLinkExtractor

答案 2 :(得分:0)