Question

I'd like to know any Subpages of a certain URL. E.g. I have the URL example.com. There might exist the subpages example.com/home, example.com/help and so on. Is it possible to get all of such subpages without knowing there exact name?

I thought I can handle this problem with a web crawler. But it just crawls for pages mentioned on the page itself.

I hope you understand my problem and can help me with it.

Thank you!

Answer 1

是的，要回答您的问题。粗暴的“爬网”蜘蛛通过设置规则来工作，这些规则可以设置为完全执行您要尝试的操作。如有疑问，请始终go to the docs!

结合注意事项：您可以使用创建普通蜘蛛时的相同方式来创建爬网蜘蛛！

scrapy genspider -t crawl nameOfSpider website.com

然后，使用爬行蜘蛛，您必须设置规则以基本告诉刮scrap的去向和去向；您的正则表达式如何？！

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com'] # PART 1: Domain Restriction
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow=('.*')), callback='parse_item'), # PART 2: Call Back
    )

现在，我从“官方”文档中复制并粘贴了此代码，并更改了它的外观，但我没有检查代码，所以是的。。。虽然有逻辑。.

我通过获取所有可以看到的链接（取决于您设置的规则）来工作，并对该链接进行了某些操作。

您想限制所有其他域，但您要限制的域是
在该示例中，我将通配符设置为从字面上接受域中的任何页面...一旦确定了网站的结构，就可以使用逻辑来构建所需的内容。

您应该更频繁地查看文档。我一直在使用scrapy大约6-7年，但我仍然发现自己可以回到手册页！

Answer 2

不，你不能。

您以这种方式描述情况，该网站会将这些所需的URL保密。

找到此类URL的任何方法都是一种安全漏洞，应该立即向网站所有者报告，以便他们进行修复。

How to get subpages of an URL without knowing them?

2 个答案: