在R中使用rvest返回凌乱的代码

时间:2015-03-27 08:22:30

标签: r web rvest

我想从taobao.com抓取网络文字:

shop <- html('http://item.taobao.com/item.htm?spm=a230r.1.14.52.OizVF6&id=42200503654&ns=1&_u=n1b61flaa96&abbucket=7#detail',encoding="utf-8")

shop  %>% 
  html_node(".tb-main-title") %>%
  html_text() %>%
  as.character()

但它不起作用,结果是:

  \n     HM7000 钃濈墮鑰虫満 涓枃鎶\xa5 绔嬩綋澹\xb0 涓€鎷栦簩 鍚煶涔\x90\n   

ps:我尝试添加编码=&#39; utf-8&#39;在html函数中。

2 个答案:

答案 0 :(得分:0)

查看目标网页的页面编码

响应标题:

_Host:detail010236101060.unit.cm4
Age:1862
at_autype:5_100262977
at_cat:item_50005050
at_isb:0
at_itemId:42200503654
at_nick:guoy087
Cache-Control:max-age=3
Connection:keep-alive
Content-Encoding:gzip
Content-Language:zh-CN
Content-Type:text/html;charset=GBK <------ Encoding is GBK
Date:Fri, 27 Mar 2015 08:43:16 GMT
S:STATUS_NORMAL
Server:Tengine
Transfer-Encoding:chunked
Vary:Accept-Encoding
Via:wagbridge010238184034.cm4[0,200-0,H]
X-Cache:HIT TCP_MEM_HIT dirn:-2:-2
X-Category:/cat/50008090

您可以看到页面编码不是UTF-8,而是GBK(描述为here

答案 1 :(得分:0)

html('http://item.taobao.com/item.htm?spm=a230r.1.14.52.OizVF6&id=42200503654&ns=1&_u=n1b61flaa96&abbucket=7#detail') %>% html_node(".tb-main-title") %>% html_text(encoding='utf-8')