Question

我有以下nginx配置来阻止某些机器人根据用户代理访问该站点。它到目前为止工作得很好但是我发现被阻止的机器人也无法访问/robots.txt文件，因此他们每天都会继续抓取数百个403错误的网站。

map $http_user_agent $block_ua {
    default            0;
    ~*yandexbot        1;
}

server {
    # Block bad bots
    if ($block_ua) {
        return 403;
    }

    location / {
        try_files $uri $uri/ /index.php?$args;
    }

    # other location blocks below...
}

我尝试更改配置以允许所有机器人访问/robots.txt，如下所示，但它不起作用，使用curl -I -A测试“yandexbot”[url]仍然返回403 Forbidden。

server {
    location = /robots.txt {
        try_files $uri $uri/ /index.php?$args;
    }

    # Block bad bots
    if ($block_ua) {
        return 403;
    }

    location / {
        try_files $uri $uri/ /index.php?$args;
    }

    # other location blocks below...
}

我应该在配置中添加什么才能获得所需的行为？

Answer 1

首先，尝试将allow all;添加到您的Nginx配置中。以下示例中的其余部分是可选的：

location = /robots.txt {
    allow all;
    log_not_found off;
    access_log off;
}

然后你可以尝试一下：

if ($block_ua) { 
    set $test A; 
} 
if ($request_uri = /robots.txt) { 
    set $test B; 
} 
if ($test = A) { 
    return 403; 
}

这是一种拥有多个if语句的hacky方式。请参阅here。

<强>解释：

allow all

允许访问指定的网络或地址（在本例中为全部）。

log_not_found off;

如果文件不存在且客户端请求它，请不要记录HTTP错误404。

if ($request_uri = /robots.txt) { set $test A; }

如果请求的文件不是robots.txt，请将$test设置为“A”

if ($block_ua){ set $test "${test}B"; }

如果UserAgent匹配，请将$test设置为“$ test + B”

if ($test = AB){ return 403; }

如果$test为“AB”，表示满足两个条件，则返回403并阻止机器人。

其他信息：

阻止UserAgent标头只会阻止某些机器人。客户端发送的所有内容都可能被欺骗。包括UserAgent字符串。

如何配置Nginx以阻止特定机器人访问该站点，但允许他们访问robots.txt？

1 个答案:

如何配置Nginx以阻止特定机器​​人访问该站点，但允许他们访问robots.txt？

1 个答案:

如何配置Nginx以阻止特定机器人访问该站点，但允许他们访问robots.txt？