Question

我想在medium.com，自定义域抓取一些网站。（例如，https://uber-developers.news/）

这些网站始终重定向到“medium.com”并返回该网站。但是问题在于，media.com的重定向网址被robots.txt禁止。

这是重定向的方式。

https://uber-developers.news/
https://medium.com/m/global-identity?redirectUrl=https://uber-developers.news/
https://uber-developers.news/?gi=e0f8caa9844c

问题出在第二个网址“https://medium.com/m/global-identity?redirectUrl=https://uber-developers.news/”之上，robots.txt不允许

User-Agent: *
Disallow: /m/
Disallow: /me/
Disallow: /@me$
Disallow: /@me/
Disallow: /*/*/edit
Allow: /_/
Allow: /_/api/users/*/meta
Allow: /_/api/users/*/profile/stream
Allow: /_/api/posts/*/responses
Allow: /_/api/posts/*/responsesStream
Allow: /_/api/posts/*/related
Sitemap: https://medium.com/sitemap/sitemap.xml

我应该考虑第二个网址的robots.txt吗？

感谢阅读。

Answer 1

robot.txt文件仅指示抓取工具应该执行的操作，但绝不会禁止抓取工具执行不同的操作。什么媒体只会停止礼貌和尊重的爬虫。

您需要遵循重定向（例如，如果您使用CURL，则有一个选项），您将到达所需的页面。但如果你大规模地这样做，Medium可能不会对此感到高兴。

当url被重定向到其他域时，我应该考虑robots.txt吗？

1 个答案: