R中包含子字符串的属性的Web抓取

时间:2016-08-08 21:14:18

标签: r xpath rvest xml2

我使用R中的xml2包来从网页中抓取数据。我要抓的文本用下面的标签括起来:

<td>
<a href="javascript:WebForm_DoPostBackWithOptions(new 
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl01&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Species A    
</a></td>
<td>
<a href="javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl02&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Species B   </a></td>
<td><a href="javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl03&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Sepcies C    </a></td>
<td>
<a href="javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl04&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Species D</a></td>
<td>
<a href="javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl05&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Species E    </a></td>

我尝试在R中使用以下代码行:

library(xml2)
page = read_html(website)
nodes = html_nodes(page, xpath='//td/a[@href*="javascript"]')

使用上面的代码,我只想提取所有具有包含子字符串&#34; javascript&#34;的href属性的节点,但我收到以下错误消息:

xmlXPathEval: evaluation failed
Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Invalid expression [1207]

如果有人有任何建议,我将不胜感激。

感谢大家的时间。

干杯。

1 个答案:

答案 0 :(得分:2)

您可以在contains中使用xpath查找包含您感兴趣的文字href的锚标记:

library(xml2)
library(rvest)
website <- '<td>
<a href="javascript:WebForm_DoPostBackWithOptions(new 
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl01&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Species A    
</a></td>
<td>
<a href="javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl02&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Species B   </a></td>
<td><a href="javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl03&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Sepcies C    </a></td>
<td>
<a href="javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl04&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Species D</a></td>
<td>
<a href="javascript:WebForm_DoPostBackWithOptions(new
WebForm_PostBackOptions(&quot;ctl00$CenterContent$ctl05&quot;,
&quot;&quot;, true, &quot;&quot;, &quot;&quot;, false,
true))">Species E    </a></td>'
page <- read_html(website)
nodes <- html_nodes(page, xpath='//td/a[contains(@href,"javascript")]')

> nodes
{xml_nodeset (5)}
[1] <a href="javascript:WebForm_DoPostBackWithOptions(new &#10;WebForm_PostBackOptions(&quot;ctl00$CenterConte ...
[2] <a href="javascript:WebForm_DoPostBackWithOptions(new&#10;WebForm_PostBackOptions(&quot;ctl00$CenterConten ...
[3] <a href="javascript:WebForm_DoPostBackWithOptions(new&#10;WebForm_PostBackOptions(&quot;ctl00$CenterConten ...
[4] <a href="javascript:WebForm_DoPostBackWithOptions(new&#10;WebForm_PostBackOptions(&quot;ctl00$CenterConten ...
[5] <a href="javascript:WebForm_DoPostBackWithOptions(new&#10;WebForm_PostBackOptions(&quot;ctl00$CenterConten ...
>