Question

我的问题

我试图从此网址抓取文件：

url <- https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words=&lang=de&top_subcollection_aza=all&from_date=01.01.2017&to_date=05.01.2017&x=0&y=0

感兴趣的单个文档的代码如下所示：

<span class="rank_title">
                  <a href="https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&amp;type=highlight_simple_query&amp;page=1&amp;from_date=01.01.2017&amp;to_date=05.01.2017&amp;sort=relevance&amp;insertion_date=&amp;top_subcollection_aza=all&amp;query_words=&amp;rank=5&amp;azaclir=aza&amp;highlight_docid=aza%3A%2F%2F05-01-2017-2C_826-2015&amp;number_of_ranks=67" title="Seite mit hervorgehobenen Suchbegriffen öffnen">05.01.2017 2C 826/2015</a>
</span>
   <span class="published_info small normal">
      <a href="https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&amp;type=highlight_simple_query&amp;page=1&amp;from_date=01.01.2017&amp;to_date=05.01.2017&amp;sort=relevance&amp;insertion_date=&amp;top_subcollection_aza=all&amp;query_words=&amp;highlight_docid=atf%3A%2F%2F143-I-73%3Ade&amp;azaclir=aza">publiziert</a>
   </span>
<div class="rank_data">
      <div class="court small normal">
      IIe Cour de droit public
   </div>

      <div class="subject small normal">
      Finances publiques &amp; droit fiscal
   </div>

      <div class="object small normal">
      Impôts communal et cantonal 2009, impôt sur la fortune; estimation de titres non cotés, garantie de la propriété
   </div>
   </div>               </li>

我在课程中被强调："rank_title"，"published info small normal"， "subject small normal"和"object small normal"。我想将这些信息存储在数据框中。

但是，并非所有文档都包含所有类（例如，在此页面上，只有一个文档具有"published info small normal"类。

如果"published info small normal"可用，我主要对提取该文档的标题感兴趣，在此示例中：

143 I 73

修改如果脚本只提取＆＃34; publiziert＆＃34;如果"published info small normal"可用，那就没关系了。

我的方法

我找到了一篇对我的问题非常有用的帖子 Scraping with rvest - complete with NAs when tag is not present

我开始实现这个：

library(XML)
doc <- xmlTreeParse(url, asText = TRUE, useInternalNodes = TRUE)

但是，我不知道如何为变量节点实现代码。

Answer 1

找到解决方案：

#read the html
pg <- read_html("url <- https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words=&lang=de&top_subcollection_aza=all&from_date=01.01.2017&to_date=05.01.2017&x=0&y=0")

xdf <- pg %>% 
        html_nodes("div.ranklist_content ol li")  %>%    # select enclosing nodes
        # iterate over each, pulling out desired parts and coerce to data.frame
      map_df(~list(link = html_nodes(.x, ".rank_title a") %>% 
                     html_attr("href") %>% 
                     {if(length(.) == 0) NA else .},    # replace length-0 elements with NA
                 title = html_nodes(.x, ".rank_title a") %>% 
                   html_text() %>% 
                   {if(length(.) == 0) NA else .},
                 publication_link = html_nodes(.x, ".published_info a") %>% 
                    html_attr("href") %>% 
                 {if(length(.) == 0) NA else .},  

                  publication = html_nodes(.x, ".published_info a") %>% 
                   html_text() %>% 
                   {if(length(.) == 0) NA else .},

                 court = html_nodes(.x, ".rank_data .court") %>% 
                   html_text(trim=TRUE) %>% 
                   {if(length(.) == 0) NA else .},

                 subject = html_nodes(.x,  ".rank_data .subject") %>% 
                   html_text(trim=TRUE) %>% 
                   {if(length(.) == 0) NA else .},
                 object = html_nodes(.x,   ".rank_data .object") %>% 
                   html_text(trim=TRUE) %>% 
                   {if(length(.) == 0) NA else .}))

如果有人可以帮助我提取class="published_info small normal"的标题，那将是非常好的。

使用可变标签的rvest刮擦

1 个答案: