尝试使用Hadley的rvest
软件包调试Web scraper并遇到编码问题。
作为可重现的示例,请考虑以下两个链接:
library(rvest)
## This works:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4234361")
## This gives me an error:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")
第一个链接:
{xml_document}
<html>
[1] <head>\n<script type="text/javascript">\r\n\r\n\t\r\nif (screen.width <= 480) {\r\n\tdocument.location = "http://www.clasificado ...
[2] <body>\n<br><link href="StylesClas.css" rel="stylesheet" type="text/css">\n<!-- Google Tag Manager --><noscript><iframe src="//w ...
第二个链接:
> read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xDA 0x4C 0x54 0x49 [9]
检查HTML BOTH 页面,我看到以下内容:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
为什么一个有用,但另一个没有?
我已尝试使用x
read_html()
包裹iconv()
,如以下相关问题所示,但不起作用:
编辑:
我正在使用以下套餐:
rvest_0.3.2
xml2_1.2.0
httr_1.3.1
任何想法?谢谢!
答案 0 :(得分:2)
使用
select
company_key,
year1sales,
year1sales / sum(year1sales) over() as year1tbi,
year2sales,
year2sales / sum(year2sales) over() as year2tbi
from
(
SELECT cmp.company_key
, sum(CASE WHEN sd.date between date '2010-01-01' and date '2010-06-11' THEN sd.qty_ship * sd.unit_price END) AS year1sales
, sum(CASE WHEN sd.date between date '2011-01-01' and date '2011-06-11' THEN sd.qty_ship * sd.unit_price END) AS year2sales
FROM sales_detail sd
INNER JOIN sales_header sh on sd.sales_header_key = sh.sales_header_key
INNER JOIN companies cmp on sh.company_key = cmp.company_key
GROUP BY cmp.company_key
)
order by company_key;
因为这是文件所说的。将数据放入元标记的问题是R需要能够读取文件才能读取该标记,但如果它没有正确的编码,则无法读取该文件