Question

我有一个基于joomla的新闻网站，其中有大量无用的页面出现在搜索引擎索引中。至少作为一个快速修复，直到我可以看看从头开始重建网站我想在除主页和以.html结尾的文章页面之外的所有页面上实现NOINDEX，FOLLOW元标记

处理找到here和elsewhere的各种代码片段后，我想出了这个：

<?php
if ((JRequest::getVar('view') == "frontpage" ) || ($_SERVER['REQUEST_URI']=='*.html' ))    {
echo "<meta name=\"robots\" content=\"index,follow\"/>\n";
} else {
echo "<meta name=\"robots\" content=\"noindex,follow\"/>\n";
}
?>

我仍然是php编程的新手，我相信我一定会犯一些错误所以我想知道一个善良的灵魂是否能够让我的代码一次性让我知道如果在我意外地破坏我的网站之前可以使用它。

谢谢，

汤姆

Answer 1

为此使用robots.txt文件不是更好吗？

一些主要的抓取工具支持Allow指令，该指令可以抵消以下Disallow指令。当一个人不允许整个目录但仍希望对该目录中的某些HTML文档进行爬网和索引时，这非常有用。虽然通过标准实现，第一个匹配的robots.txt模式总是获胜，但Google的实现不同之处在于，在指令路径中允许具有相同或更多字符的模式会胜过匹配的Disallow模式。 Bing使用最具体的Allow或Disallow指令。

为了与所有机器人兼容，如果想要在其他不允许的目录中允许单个文件，则必须首先放置Allow指令，然后放置Disallow，例如：
Allow: /folder1/myfile.html
Disallow: /folder1/
此示例将禁止/ folder1 / / / folder1 / myfile.html中的任何内容，因为后者将首先匹配。但就谷歌而言，订单并不重要。

Answer 2

这永远不会匹配：

$_SERVER['REQUEST_URI']=='*.html'

==是一个文字比较，不解析通配符。您可以使用substr：

检查字符串的结尾

substr($_SERVER['REQUEST_URI'], -5) == '.html'

或者您可以使用正则表达式：

//This will match when .html is enywhere inside the string
preg_match('/\.html/', $_SERVER['REQUEST_URI'])

//This will match when .html is at the end of the string, but the
//substr solution is faster in that case
preg_match('/\.html$/', $_SERVER['REQUEST_URI'])

Answer 3

从这里的海报和我的朋友那里得到建议：

你需要去/ public_html / libraries / joomla / document / html并编辑html.php

替换

//set default document metadata
     $this->setMetaData('Content-Type', $this->_mime . '; charset=' . $this->_charset , true );
     $this->setMetaData('robots', 'index, follow' );

与

//set default document metadata
$this->setMetaData('Content-Type', $this->_mime . '; charset=' . $this->_charset , true );

$queryString = $_SERVER['REQUEST_URI'];
if (( $queryString == '' ) || ( $queryString == 'index.php/National-news' ) || ( $queryString == 'index.php/Business' ) || ( $queryString == 'index.php/Sport' ) || ( substr($queryString, -5 ) == '.html' )) {
$this->setMetaData('robots', 'index, follow' );
}else {
$this->setMetaData('robots', 'noindex, follow' );
}

这将更新网站上每个页面上的元机器人标记，从搜索引擎中删除所有混乱的内容，只留下我们想要在索引中找到的内容。

我会在接下来的几天内尝试在测试服务器上运行它并报告回来。

Joomla noindex，请关注PHP代码

3 个答案: