Question

我正在使用自定义域在Heroku上运行Rails应用程序。我们来调用我的Heroku应用myapp.herokuapp.com和自定义域www.myapp.com。我不小心得到了myapp.herokuapp.com索引（大约700-3000个索引页面），导致两者之间出现重复内容。

我最近发现了这个并在应用程序控制器中的一个before_filter中添加了301，如下所示：

  def forward_from_heroku
    redirect_to "http://www.myapp.com#{request.path}", :status => 301  if request.host.include?('herokuapp')      
  end

这成功地将（几乎）所有来自myapp.herokuapp.com的流量重定向到www.myapp.com我还要求在Google网站站长工具中对myapp.com进行地址更改。

这种方法很好，除了公共文件夹中的文件（显然）。问题是它仍然访问robots.txt和sitemap.xml，后者又指向外部站点地图（在AWS上）。我可以看到Google-bot如何解释这一点，因为在myapp.herokuapp.com上仍有待浏览的内容（尽管一切都是301）。

我想要做的是向应用添加代码，以便Google通过myapp.herokuapp.com访问网站时，如果通过www.myapp访问，则会获得一个sitemap.xml / robots.txt和另一个。 COM

我如何在config.rb或其他地方编写代码？基本上，我需要绕过myapp.herokuapp.com的公共文件夹。

Answer 1

您可以根据域限制路线：

scope constraints: {host: /^regex-matching-your-domain/} do

然后在该范围内为robots.txt和sitemap.xml返回404：

scope constraints: {host: /heroku.com$/} do
  get '/robots.txt' => Proc.new { |env|
    [404, {'Content-Type' => 'text/plain'}, ['Not Found']]
  }
end

还：您可以考虑使用规范网址。它可能是一个更有效的搜索引擎优化解决方案，我不确定。 https://support.google.com/webmasters/answer/139066?hl=en

Answer 2

这就是我所做的，它并不优雅，但它有效。我从公共文件夹中删除了sitemap.xml和robots.txt，并将它们放在config文件夹中。然后：

routes.rb

  get '/robots.txt' => 'home#robots'
  get '/sitemap.xml' => 'home#sitemaps'


  def robots 
    unless request.host.eql?('myapp.herokuapp.com')
      robots = File.read(Rails.root + "config/robots.txt")
      render :text => robots, :layout => false, :content_type => "text/plain"    
    end
  end

  def sitemaps
    unless request.host.eql?('myapp.herokuapp.com')
      sitemaps = File.read(Rails.root + "config/sitemap.xml")
      render :text => sitemaps, :layout => false, :content_type => "text/xml"    
    end    
  end

忽略特定域的公用文件夹文件（robots.txt和sitemap.xml）

2 个答案: