拒绝Scrapy中的某些URL

时间:2015-01-19 16:09:08

标签: python web-scraping scrapy screen-scraping

我的Scrapy项目中有以下代码:

rules = [
    Rule(LinkExtractor(allow="/uniprot/[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}"),
        callback="parsethings", follow=False),
    Rule(LinkExtractor(deny_domains=["help", "category", "citations", "taxonomy","diseases", "locations", "docs", "uniref", "proteomes"])),
    Rule(LinkExtractor(deny_domains=[".fasta","?version","?query","?"])),
]

我正在尝试用uniprot(www.uniprot.org)来获取基因/蛋白质的名称和长度。

第一个和最后一个规则用于阻止具有“.fasta”或版本修订号的基因页面的10,000个副本,但是,我似乎无法在“/ help”下阻止URL,“ / category“等。

基本上,我只想在uniprot.org/uniprot下抓取网址。如果我将allowed_domains设置为“http://www.uniprot.org/uniprot/”,那么蜘蛛实际上会阻止“www.uniprot.org/uniprot/Q6GZX3”然后死掉。

如何让scrapy只抓取/ uniprot子域中的网址?

3 个答案:

答案 0 :(得分:0)

将第二条和第三条规则合并为一条:

rules = [
    Rule(LinkExtractor(allow="/uniprot/[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}"),
        callback="parsethings", follow=False),
    Rule(LinkExtractor(deny_domains=["help", "category", "citations", "taxonomy","diseases", "locations", "docs", "uniref", "proteomes", ".fasta", "?version", "?query", "?"])),
]

答案 1 :(得分:0)

未经过测试,但您不应该使用单一规则:

rules = [
     Rule(LinkExtractor(allow="/uniprot/[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}", deny_domains=["help", "category", "citations", "taxonomy","diseases", "locations", "docs", "uniref", "proteomes", ".fasta", "?version", "?query", "?"]),
    callback="parsethings")
]

修改:删除重复的括号+ follow=false

答案 2 :(得分:0)

使用 deny 而不是 deny_domains