我想从taobao.com抓取网络文字:
shop <- html('http://item.taobao.com/item.htm?spm=a230r.1.14.52.OizVF6&id=42200503654&ns=1&_u=n1b61flaa96&abbucket=7#detail',encoding="utf-8")
shop %>%
html_node(".tb-main-title") %>%
html_text() %>%
as.character()
但它不起作用,结果是:
\n HM7000 钃濈墮鑰虫満 涓枃鎶\xa5 绔嬩綋澹\xb0 涓€鎷栦簩 鍚煶涔\x90\n
ps:我尝试添加编码=&#39; utf-8&#39;在html函数中。
答案 0 :(得分:0)
查看目标网页的页面编码
响应标题:
_Host:detail010236101060.unit.cm4
Age:1862
at_autype:5_100262977
at_cat:item_50005050
at_isb:0
at_itemId:42200503654
at_nick:guoy087
Cache-Control:max-age=3
Connection:keep-alive
Content-Encoding:gzip
Content-Language:zh-CN
Content-Type:text/html;charset=GBK <------ Encoding is GBK
Date:Fri, 27 Mar 2015 08:43:16 GMT
S:STATUS_NORMAL
Server:Tengine
Transfer-Encoding:chunked
Vary:Accept-Encoding
Via:wagbridge010238184034.cm4[0,200-0,H]
X-Cache:HIT TCP_MEM_HIT dirn:-2:-2
X-Category:/cat/50008090
您可以看到页面编码不是UTF-8
,而是GBK
(描述为here)
答案 1 :(得分:0)
html('http://item.taobao.com/item.htm?spm=a230r.1.14.52.OizVF6&id=42200503654&ns=1&_u=n1b61flaa96&abbucket=7#detail') %>% html_node(".tb-main-title") %>% html_text(encoding='utf-8')