Question

我从forexfactory.com的经济日历表中得到了这个xml_nodeset（我甚至不知道这是否是正确的术语）：

<td class="calendar__cell calendar__previous previous">44.7</td>
<td class="calendar__cell calendar__previous previous"><span class="revised worse" title="Revised From -0.6%">-1.1%<span class="icon icon--revised"></span></span></td>

在第一种情况下，我希望得到一个空字符串或NA，在第二种情况下，值为“修订自-0.6％”。

基本上，我想要一个额外的列，如果没有修改后的值为空，并且如果有任何值则保留修订后的值。

我试过

%>% html_attr(x, "title")和%>% html_attrs(x)受到此问题的启发here但没有成功。

当然，x持有xml_node。

对不起，如果这是一个菜鸟问题！

Answer 1

可能不是最佳解决方案，但它适用于您的代码示例：

library(rvest)
tmp <- read_html('<td class="calendar__cell calendar__previous previous">44.7</td>
             <td class="calendar__cell calendar__previous previous"><span class="revised worse" title="Revised From -0.6%">-1.1%<span class="icon icon--revised"></span></span></td>')
tmp2 <- tmp %>% 
  html_nodes("td")
tmp3 <- lapply( tmp2, function(x) {
  tmp4 <- html_children(x)
  ifelse( length(html_attr(tmp4, "title") > 0),
      yes = html_attr(tmp4, "title"),
      no = NA)
    }
  )
unlist(tmp3)

顺便说一句，请勿使用%>% html_attr(x, "title")，请使用%>% html_attr("title")。

Answer 2

这是另一种可能的解决方案。两个部分找到td节点，然后找到其中修改了类的span节点。如果找不到节点，函数html_node（）将返回NA，因此输出数等于输入数。

library(rvest)
page <- read_html('<td class="calendar__cell calendar__previous previous">44.7</td>
                 <td class="calendar__cell calendar__previous previous">
                 <span class="revised worse" title="Revised From -0.6%">-1.1%
                 <span class="icon icon--revised"></span></span></td>')

#find the td nodes
tdnode <- page %>% html_nodes("td")

#find span nodes within 'td' nodes with the class 'revised'
#Extract the attribute associated with 'title'
tdnode %>% html_node("span.revised") %>% html_attr("title")

用rvest

2 个答案: