Question

我正在使用scrapy从网站上抓取网址。目前它返回所有网址，但我希望它只返回包含“下载”一词的网址。我怎样才能做到这一点？

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy

DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN) 

class MySpider(scrapy.Spider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

编辑：

我实施了以下建议。它仍然会抛出一些错误，但至少这只返回包含下载的链接。

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
from scrapy.linkextractors import LinkExtractor


DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN) 

class MySpider(scrapy.Spider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

# First parse returns all the links of the website and feeds them to parse2 

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            yield Request(url, callback=self.parse2)

# Second parse selects only the links that contains download

    def parse2(self, response):
        le = LinkExtractor(allow=("download"))
        for link in le.extract_links(response):
                yield Request(url=link.url, callback=self.parse2)
                print link.url

Answer 1

更加pythonic和干净的解决方案，是使用LinkExtractor：

from scrapy.linkextractors import LinkExtractor

...

le = LinkExtractor(deny="download")
for link in le.extract_links(response):
    yield Request(url=link.url, callback=self.parse)

Answer 2

您正在尝试检查字符串中是否存在子字符串。

示例：

string = 'this is a simple string' 'simple' in string True 'zimple' in string False

因此，您只需要添加if语句，如：

if 'download' in url:

行后：

for url in hxs.select('//a/@href').extract():

<强>即：

for url in hxs.select('//a/@href').extract(): if 'download' in url: if not ( url.startswith('http://') or url.startswith('https://') ): url = URL + url print url yield Request(url, callback=self.parse)

因此，如果条件http://返回'download' in url，代码将仅检查链接是否以True开头。

仅在scrapy中返回特定网址

2 个答案: