R:rvest read_html输入不正确的UTF-8编码错误

时间:2018-06-12 18:59:46

标签: r web-scraping rvest

尝试使用Hadley的rvest软件包调试Web scraper并遇到编码问题。

作为可重现的示例,请考虑以下两个链接:

library(rvest)

## This works:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4234361")

## This gives me an error:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")

第一个链接:

{xml_document}
<html>
[1] <head>\n<script type="text/javascript">\r\n\r\n\t\r\nif (screen.width <= 480) {\r\n\tdocument.location = "http://www.clasificado ...
[2] <body>\n<br><link href="StylesClas.css" rel="stylesheet" type="text/css">\n<!-- Google Tag Manager --><noscript><iframe src="//w ...

第二个链接:

> read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Input is not proper UTF-8, indicate encoding !
Bytes: 0xDA 0x4C 0x54 0x49 [9]

检查HTML BOTH 页面,我看到以下内容:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

为什么一个有用,但另一个没有?

我已尝试使用x read_html()包裹iconv(),如以下相关问题所示,但不起作用:

  1. R: rvest - is not proper UTF-8, indicate encoding?
  2. encoding error with read_html
  3. 编辑:

    我正在使用以下套餐:

    • rvest_0.3.2
    • xml2_1.2.0
    • httr_1.3.1

    任何想法?谢谢!

1 个答案:

答案 0 :(得分:2)

使用

select
  company_key,
  year1sales,
  year1sales / sum(year1sales) over() as year1tbi,
  year2sales,
  year2sales / sum(year2sales) over() as year2tbi
from
(
  SELECT cmp.company_key
      , sum(CASE WHEN sd.date between date '2010-01-01' and date '2010-06-11' THEN sd.qty_ship * sd.unit_price END) AS year1sales
      , sum(CASE WHEN sd.date between date '2011-01-01' and date '2011-06-11' THEN sd.qty_ship * sd.unit_price END) AS year2sales
  FROM sales_detail sd
  INNER JOIN sales_header sh on sd.sales_header_key = sh.sales_header_key
  INNER JOIN companies cmp on sh.company_key = cmp.company_key
  GROUP BY cmp.company_key
)
order by company_key;

因为这是文件所说的。将数据放入元标记的问题是R需要能够读取文件才能读取该标记,但如果它没有正确的编码,则无法读取该文件