Question

我正在从网站上抓取数据，我需要迭代页面，但不是计数器，而是按字母顺序排列索引

    http://funny2.com/jokesb.htm'
    http://funny2.com/jokesc.htm')
    ...

但我无法弄清楚如何包含[a-z]迭代器。我试过了

    http://funny2.com/jokes^[a-z]+$.htm'

哪个不起作用。

Answer 1

您可以遍历字母表中的每个字母，并将该字母格式化为某个网址模板：

from string import ascii_lowercase

# 'abcdefghijklmnopqrstuvwxyz'
from char in ascii_lowercase:
    url = "http://funny2.com/jokes{}.htm".format(char)

在scrapy上下文中，您需要找到一种在url中增加字符的方法。您可以使用正则表达式找到它，找出字母表中的下一个字符并将其放入当前网址，例如：

import re    
from string import ascii_lowercase

def parse(self, response):
    current_char = re.findall('jokes(\w).htm', response.url)
    next_char = ascii_lowercase[current_char] + 1
    next_char = ascii_lowercase[next_char]
    next_url = re.sub('jokes(\w).htm', 'jokes{}.htm'.format(next_char), response.url)
    yield Request(next_url, self.parse2)

Answer 2

XPath不支持正则表达式。但是，当Scrapy在lxml之上构建时，它支持一些EXSLT扩展，特别是re扩展。您可以使用EXSLT中的操作将它们添加到相应的命名空间中，如下所示：

response.xpath('//a[re:test(@href, "jokes[a-z]+\.htm")]/@href')

文档：https://doc.scrapy.org/en/latest/topics/selectors.html?highlight=selector#using-exslt-extensions

如果您只需要提取链接，请使用LinkExtractor和regexp：

LinkExtractor(allow=r'/jokes[a-z]+\.htm').extract_links(response)

如何使用[a-z]中的字符生成xpath链接

2 个答案: