Question

我想获取互联网上与正则表达式匹配的所有网址列表。

E.g。我想知道site：jobs.lever.co下的所有唯一页面网址，返回200 OK。例如。 https://jobs.lever.co/reddit效果不错，https://jobs.lever.co/reddit?utm_source=fff和https://jobs.lever.co/bl4hbl4h效果不佳。

获取此数据的任何提示/技巧？谢谢！

Answer 1

如果你正在寻找一个出色的爬虫（在python中）并且愿意花一些时间进行微调，那么看看Scrapy。它使用xPath和CSS查询的组合可能的蜘蛛看起来像下面的代码：

import scrapy

class leverSpider(scrapy.Spider):
    name = 'lever'
    start_urls = (
        'https://jobs.lever.co/reddit',
        )

    def parse(self, response):
        # get all links
        for link in response.xpath("//a[@class='posting-title']/@href").extract():
            # do sth. with the link, e.g. parse the item
            yield scrapy.Request(link, self.parse_item)

    def parse_item(self, response):
        # do sth. useful with your link here

您可以像scrapy crawl lever一样启动它，它会搜索类posting-title的链接并请求这些页面。

获取Internet上与正则表达式匹配的所有URL列表

1 个答案: