仅在scrapy中返回特定网址

时间:2017-03-27 16:10:41

标签: python scrapy

我正在使用scrapy从网站上抓取网址。目前它返回所有网址,但我希望它只返回包含“下载”一词的网址。我怎样才能做到这一点?

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy

DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN) 

class MySpider(scrapy.Spider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

编辑:

我实施了以下建议。它仍然会抛出一些错误,但至少这只返回包含下载的链接。

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
from scrapy.linkextractors import LinkExtractor


DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN) 

class MySpider(scrapy.Spider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

# First parse returns all the links of the website and feeds them to parse2 

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            yield Request(url, callback=self.parse2)

# Second parse selects only the links that contains download

    def parse2(self, response):
        le = LinkExtractor(allow=("download"))
        for link in le.extract_links(response):
                yield Request(url=link.url, callback=self.parse2)
                print link.url

2 个答案:

答案 0 :(得分:2)

更加pythonic和干净的解决方案,是使用LinkExtractor

from scrapy.linkextractors import LinkExtractor

...

le = LinkExtractor(deny="download")
for link in le.extract_links(response):
    yield Request(url=link.url, callback=self.parse)

答案 1 :(得分:1)

您正在尝试检查字符串中是否存在子字符串。

示例:

string = 'this is a simple string'

'simple' in string
True

'zimple' in string
False

因此,您只需要添加if语句,如:

if 'download' in url:

行后:

for url in hxs.select('//a/@href').extract():

<强>即:

for url in hxs.select('//a/@href').extract():
    if 'download' in url:
        if not ( url.startswith('http://') or url.startswith('https://') ):
            url = URL + url 
        print url
        yield Request(url, callback=self.parse)

因此,如果条件http://返回'download' in url,代码将仅检查链接是否以True开头。