Question

尝试使用Hadley的rvest软件包调试Web scraper并遇到编码问题。

作为可重现的示例，请考虑以下两个链接：

library(rvest)

## This works:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4234361")

## This gives me an error:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")

第一个链接：

{xml_document}
<html>
[1] <head>\n<script type="text/javascript">\r\n\r\n\t\r\nif (screen.width <= 480) {\r\n\tdocument.location = "http://www.clasificado ...
[2] <body>\n<br><link href="StylesClas.css" rel="stylesheet" type="text/css">\n<!-- Google Tag Manager --><noscript><iframe src="//w ...

第二个链接：

> read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Input is not proper UTF-8, indicate encoding !
Bytes: 0xDA 0x4C 0x54 0x49 [9]

检查HTML BOTH 页面，我看到以下内容：

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

为什么一个有用，但另一个没有？

我已尝试使用x read_html()包裹iconv()，如以下相关问题所示，但不起作用：

R: rvest - is not proper UTF-8, indicate encoding?
encoding error with read_html

编辑：

我正在使用以下套餐：

rvest_0.3.2
xml2_1.2.0
httr_1.3.1

任何想法？谢谢！

Answer 1

使用

select
  company_key,
  year1sales,
  year1sales / sum(year1sales) over() as year1tbi,
  year2sales,
  year2sales / sum(year2sales) over() as year2tbi
from
(
  SELECT cmp.company_key
      , sum(CASE WHEN sd.date between date '2010-01-01' and date '2010-06-11' THEN sd.qty_ship * sd.unit_price END) AS year1sales
      , sum(CASE WHEN sd.date between date '2011-01-01' and date '2011-06-11' THEN sd.qty_ship * sd.unit_price END) AS year2sales
  FROM sales_detail sd
  INNER JOIN sales_header sh on sd.sales_header_key = sh.sales_header_key
  INNER JOIN companies cmp on sh.company_key = cmp.company_key
  GROUP BY cmp.company_key
)
order by company_key;

因为这是文件所说的。将数据放入元标记的问题是R需要能够读取文件才能读取该标记，但如果它没有正确的编码，则无法读取该文件

R：rvest read_html输入不正确的UTF-8编码错误

1 个答案: