Question

为什么我在解析网络时会出现乱码？

我使用encoding="big-5\\IGNORE"来获取正常字符，但它不起作用。

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5\\IGNORE")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)

enter image description here

我应该如何修改代码以将乱码变为正常？

enter image description here

@MartinMorgan（下面）建议使用

htmlParse(url,isURL=TRUE,encoding="big-5")

这是一个正在发生的事情的例子：

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock

enter image description here

总记录应该是1335.在上面的情况下它是309 - 许多记录似乎已经丢失

这是一个复杂的问题。有很多问题：

格式错误的html文件

网络不是标准网页，不是格式良好的html文件，让我证明我的观点请运行：

url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)

如何用firefox打开下载的文件stockbig-5？ enter image description here

R中的Iconv函数错误如果html文件格式正确，您可以使用

数据= readlines方法（文件）
datachange = iconv（data，from =“source encode”，to =“target encode \ IGNORE”）

当html文件格式不正确时，你可以这样做，在这个例子中，
请跑，

data=readLines(stockbig-5)

将发生错误。

1: In readLines("stockbig-5") :  
  invalid input found on input connection 'stockbig-5'

你不能在R中使用iconv函数来改变错误形成的html文件中的编码。

但是你可以在shell中执行此操作

Answer 1

我已经解决了一个晚上，困难时期系统：debian6（locale utf-8）+ R2.15（locale utf-8）+ gnome终端（locale utf-8）。
这是代码：

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
system('iconv -f big-5  -t  UTF-8//IGNORE    stockbig-5  > stockutf-8')
data=htmlParse("stockutf-8",isURL=FALSE,encoding="utf-8\\IGNORE")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock

enter image description here

我希望我的代码更优雅，R代码中的shell命令可能很难，

system（'iconv -f big5 -t UTF-8 // IGNORE stockgb2312＆gt; stockutf-8'）

我试图用纯R代码替换它，失败了，如何在纯R代码中替换它？您可以使用代码在计算机中复制结果。一半完成，一半成功，继续尝试。

为什么我会出现乱码？

1 个答案: