Question

我是R新手，在这里我遇到了一个问题，即使用R从一个网站的API获取基金的每日净值。

运行 htmlTreeParse 时，错误是“ XML内容似乎不是XML”。我搜索了这个问题，但那些答案（例如使用http而不是https）不适合我的问题。

如果您知道如何解决，请多多帮助。由于其中包含一些中文字符，因此您可能看不到它们正确显示。

library(RCurl)
library(XML)

myHttpheader <- c(
"User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ","Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language"="en-us","Connection"="keep-alive","Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7"
)

url<"http://fund.eastmoney.com/f10/F10DataApi.aspx?type=lsjz&code=160212&page=1&per=2&sdate=&edate="

webpage <- getURL(url,httpheader=myHttpheader)

pagetree <- htmlTreeParse(webpage,encoding="GB2312", error=function(...){}, useInternalNodes = TRUE,trim=TRUE)

net_value<-getNodeSet(pagetree, '/table/tbody/tr[1]/td[3]')

这是运行函数 htmlTreeParse 时的错误消息：

> pagetree <- htmlTreeParse(webpage,encoding="GB2312", error=function(...){}, useInternalNodes = TRUE,trim=TRUE)
Warning message:
XML content does not seem to be XML: 'var apidata={ content:"<table class='w782 comm lsjz'><thead><tr><th class='first'>净值日期</th><th>单位净值</th><th>累计净值</th><th>日增长率</th><th>申购状态</th><th>赎回状态</th><th class='tor last'>分红送配</th></tr></thead><tbody><tr><td>2018-07-06</td><td class='tor bold'>1.0620</td><td class='tor bold'>1.1140</td><td class='tor bold red'>1.92%</td><td>开放申购</td><td>开放赎回</td><td class='red unbold'></td></tr><tr><td>2018-07-05</td><td class='tor bold'>1.0420</td><td class='tor bold'>1.0940</td><td class='tor bold red'>0.39%</td><td>开放申购</td><td>开放赎回</td><td class='red unbold'></td></tr></tbody></table>",records:715,pages:358,curpage:1};'

Answer 1

这是R编码和XML库的问题。

这有效：

library(RCurl)
library(XML)

myHttpheader <- c(
  "User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ","Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language"="en-us","Connection"="keep-alive","Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7"
)
url <- "http://fund.eastmoney.com/f10/F10DataApi.aspx?type=lsjz&code=160212&page=1&per=2&sdate=&edate="
webpage <- getURL(url, httpheader = myHttpheader)
webpage
[1] "var apidata={ content:\"<table class='w782 comm lsjz'><thead><tr><th class='first'>净值日期</th><th>单位净值</th><th>累计净值</th><th>日增长率</th><th>申购状态</th><th>赎回状态</th><th class='tor last'>分红送配</th></tr></thead><tbody><tr><td>2018-07-06</td><td class='tor bold'>2.1640</td><td class='tor bold'>2.1640</td><td class='tor bold grn'>-0.05%</td><td>开放申购</td><td>开放赎回</td><td class='red unbold'></td></tr><tr><td>2018-07-05</td><td class='tor bold'>2.1650</td><td class='tor bold'>2.1650</td><td class='tor bold grn'>-1.81%</td><td>开放申购</td><td>开放赎回</td><td class='red unbold'></td></tr></tbody></table>\",records:2034,pages:1017,curpage:1};"

这是一个问题，当您使用本地配置进行解析并且GB2312会删除带有中文字符的节点。

Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
pagetree <- htmlTreeParse(webpage, encoding = "GB2312", 
                          error = function(...){}, useInternalNodes = TRUE, trim = TRUE)
Warning message:
XML content does not seem to be XML: 'var apidata={ content:"<table class='w782 comm lsjz'><thead><tr><th class='first'><U+51C0><U+503C><U+65E5><U+671F></th><th><U+5355><U+4F4D><U+51C0><U+503C></th><th><U+7D2F><U+8BA1><U+51C0><U+503C></th><th><U+65E5><U+589E><U+957F><U+7387></th><th><U+7533><U+8D2D><U+72B6><U+6001></th><th><U+8D4E><U+56DE><U+72B6><U+6001></th><th class='tor last'><U+5206><U+7EA2><U+9001><U+914D></th></tr></thead><tbody><tr><td>2018-07-06</td><td class='tor bold'>2.1640</td><td class='tor bold'>2.1640</td><td class='tor bold grn'>-0.05%</td><td><U+5F00><U+653E><U+7533><U+8D2D></td><td><U+5F00><U+653E><U+8D4E><U+56DE></td><td class='red unbold'></td></tr><tr><td>2018-07-05</td><td class='tor bold'>2.1650</td><td class='tor bold'>2.1650</td><td class='tor bold grn'>-1.81%</td><td><U+5F00><U+653E><U+7533><U+8D2D></td><td><U+5F00><U+653E><U+8D4E><U+56DE></td><td class='red unbold'></td></tr></tbody></table>",records:2034,pages:1017,curpage:1};' 
pagetree
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>var apidata={ content:"</p>
<table class="w782 comm lsjz"><thead><tr><th class="first"></th></tr></thead></table>
</body></html>

这是一个解决方案。

Sys.setlocale(category="LC_ALL", locale = "chinese")
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
pagetree <- htmlTreeParse(webpage, error = function(...){}, useInternalNodes = TRUE, trim = TRUE)
Warning message:
XML content does not seem to be XML: 'var apidata={ content:"<table class='w782 comm lsjz'><thead><tr><th class='first'>净值日期</th><th>单位净值</th><th>累计净值</th><th>日增长率</th><th>申购状态</th><th>赎回状态</th><th class='tor last'>分红送配</th></tr></thead><tbody><tr><td>2018-07-06</td><td class='tor bold'>2.1640</td><td class='tor bold'>2.1640</td><td class='tor bold grn'>-0.05%</td><td>开放申购</td><td>开放赎回</td><td class='red unbold'></td></tr><tr><td>2018-07-05</td><td class='tor bold'>2.1650</td><td class='tor bold'>2.1650</td><td class='tor bold grn'>-1.81%</td><td>开放申购</td><td>开放赎回</td><td class='red unbold'></td></tr></tbody></table>",records:2034,pages:1017,curpage:1};' 
pagetree
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>var apidata={ content:"</p>
<table class="w782 comm lsjz">
<thead><tr>
<th class="first">氓聡聙氓聙录忙聴楼忙聹聼</th>
<th>氓聧聲盲陆聧氓聡聙氓聙录</th>
<th>莽麓炉猫庐隆氓聡聙氓聙录</th>
<th>忙聴楼氓垄聻茅聲驴莽聨聡</th>
<th>莽聰鲁猫麓颅莽聤露忙聙聛</th>
<th>猫碌聨氓聸聻莽聤露忙聙聛</th>
<th class="tor last">氓聢聠莽潞垄茅聙聛茅聟聧</th>
</tr></thead>
<tbody>
<tr>
<td>2018-07-06</td>
<td class="tor bold">2.1640</td>
<td class="tor bold">2.1640</td>
<td class="tor bold grn">-0.05%</td>
<td>氓录聙忙聰戮莽聰鲁猫麓颅</td>
<td>氓录聙忙聰戮猫碌聨氓聸聻</td>
<td class="red unbold"></td>
</tr>
<tr>
<td>2018-07-05</td>
<td class="tor bold">2.1650</td>
<td class="tor bold">2.1650</td>
<td class="tor bold grn">-1.81%</td>
<td>氓录聙忙聰戮莽聰鲁猫麓颅</td>
<td>氓录聙忙聰戮猫碌聨氓聸聻</td>
<td class="red unbold"></td>
</tr>
</tbody>
</table>",records:2034,pages:1017,curpage:1};</body></html>

您只需要再加上一个斜杠即可获得所需的值。

getNodeSet(pagetree, '//table/tbody/tr[1]/td[3]')
[[1]]
<td class="tor bold">2.1640</td> 

attr(,"class")
[1] "XMLNodeSet"

R解析：错误“ XML内容似乎不是XML”

1 个答案: