Question

我的目的是获取PHP的链接，也许用Simple PHP DOM parser（或类似的东西）解析内容并查找H1-H6标签。但在此之前，我需要找出页面是否被编入索引。

除了解析内容并搜索<meta name="robots" content="noindex">或类似内容之外，有没有办法在robots.txt中检查页面是否设置为noindex？

Answer 1

页面指定noindex的方式有两种：通过该部分中的meta HTML标记（如您所述）或通过响应中的HTTP标头。

最重要的是，有两种方法可以指定noindex：一种是“ noindex”，另一种是“ none”（相当于“ noindex，nofollow”）。

HTML标签可以定位到多个爬网程序，并且看起来像这样：

<meta name="robots" content="noindex" />

或

<meta name="googlebot" content="noindex" />

或

<meta name="AdsBot-Google" content="noindex" />

或其他人。

所以检查noindex的方法是同时做这两项：

检查HTTP响应中是否包含“ noindex”或“ none”的X-Robots-Tag（尝试curl -I https://www.example.com以查看其外观）
获取HTML并在内容属性中扫描元标记中的“ noindex”或“ none”