Question

在SgmlLinkExtractor规则中是否只允许在/ static /和/ otherstuff /之间允许有限数量的目录（比如3）？所以在下面的例子中，EX1不会被抓取（因为/ static /和/ otherstuff /之间有四个目录），但EX2会是。

EX1：http://www.domain.com/static/d1/d2/d3/d4/otherstuff/otherstuff2/bunchacrap
EX2：http:///www.domain.com/static/d1/d2/otherstuff/otherstuff2/bunchacrap

假设/ static /和/ otherstuff /总是在我想要的目录的两边。

感谢TON提供任何帮助！

Answer 1

您可以在allow参数中使用正则表达式，也可以在process_value参数中使用测试函数。（参见docs。）

两者都有它们的优点和缺点，这取决于它在页面中的链接。如果使用正则表达式，则使用完全限定的URL进行测试（即http://domain.com/foo/bar）。如果您使用process_value参数，则会获得网页中找到的原始值（即/ foo / bar或更糟糕的是，相对链接）。

例如，正则表达式domain.com/(?:\w+/){1,3}\w+$匹配

domain.com/foo/bar
domain.com/foo/bar/foo
domain.com/foo/bar/foo/bar

但不是

domain.com/foo/
domain.com/foo/bar/foo/bar/foo

如果你使用process_value，那么这样的函数就可以了

def filter_path(value):
    # at least 2, at most 3 /'s
    if 1 < value.count('/') < 4:
        return value

上述功能假设您的html链接具有href的值，如/foo，/foo/bar/foo等。

在您的具体情况下，正则表达式与domain.com/static/(?:\w+/){3}otherstuff类似，filter_path函数可能会检查value.startswith('/static/')和后缀。

如果您在Rule中使用CrawlSpider课程，则有第三种选择。 process_links参数允许您传递一个函数来处理链接列表。例如

def url_allowed(url):
    # check for the pattern /static/dir/dir/dir/ etc
    return True

def process_links(links):
    return [l for l in links if url_allowed(l.url)]

Scrapy - 限制中间目录（Python）

1 个答案: