所以我让控制器抓取整个页面的html并将其存储到mysql数据库中。在我存储数据之前,我想使用htmlentities gem对其进行编码。我的问题是,对于某些网站,它可以正常运行,例如https://www.lookagain.co.uk/
,但是对于其他网站,我得到invalid byte sequence in UTF-8
,例如https://www.google.co.uk/
,我不知道为什么。起初我虽然数据库可能有问题所以我已将所有字段更改为LONGTEXT但问题仍然存在
控制器:
class PageScraperController < ApplicationController
require 'nokogiri'
require 'open-uri'
require 'diffy'
require 'htmlentities'
def scrape
@url = watched_link_params[:url].to_s
puts "LOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOG#{@url}"
@page = Nokogiri::HTML(open(@url))
coder = HTMLEntities.new
@encodedHTML = coder.encode(@page)
create
end
def index
@savedHTML = ScrapedPage.all
end
def show
@savedHTML = ScrapedPage.find(id)
end
def new
@savedHTML = ScrapedPage.new
end
def create
@savedHTML = ScrapedPage.create(domain: @url, html: @encodedHTML, css: '', javascript: '')
if @savedHTML.save
puts "ADDED TO THE DATABASE"
redirect_to(root_path)
else
puts "FAILED TO ADD TO THE DATABASE"
end
end
def edit
end
def upadate
end
def delete
@watched_links = ScrapedPage.find(params[:id])
end
def destroy
@watched_links = ScrapedPage.find(params[:id])
@watched_links.destroy
redirect_to(root_path)
end
def watched_link_params
params.require(:default).permit(:url)
end
end