修改

Question

我正在尝试为我的项目的某些天文学相关定义刮取维基。代码工作得很好，但我无法避免404s。我试过了tryCatch。我想我在这里错过了一些东西。

我正在寻找一种在运行循环时克服404的方法。这是我的代码：

library(rvest)
library(httr)
library(XML)
library(tm)


topic<-c("Neutron star", "Black hole", "sagittarius A")

for(i in topic){

  site<- paste("https://en.wikipedia.org/wiki/", i)
  site <- read_html(site)

  stats<- xmlValue(getNodeSet(htmlParse(site),"//p")[[1]]) #only the first paragraph
  #error = function(e){NA}

  stats[["topic"]] <- i

  stats<- gsub('\\[.*?\\]', '', stats)
  #stats<-stats[!duplicated(stats),]
  #out.file <- data.frame(rbind(stats,F[i]))

  output<-rbind(stats,i)

}

Answer 1

使用sprintf。
从段落节点中提取所有正文文本。
删除任何返回长度为（0）的向量
我添加了一个步骤，以包含所有由前缀[paragraph - n]注释的正文文本供参考。因为好吧...朋友们不要让朋友浪费数据或发出多个http请求。
以下列形式为主题列表中的每次迭代构建一个数据框：
将列表中的所有data.frames绑定到一个......
wiki_url：应该是显而易见的
主题：来自主题列表
info_summary：第一段（您在帖子中提到）
all_info：如果您需要更多......知道。
请注意，我使用较旧的rvest源代码

为了便于理解，我只是将名称html分配给你的read_html。

   library(rvest)
   library(jsonlite)

   html <- rvest::read_html

   wiki_base <- "https://en.wikipedia.org/wiki/%s"

   my_table <- lapply(sprintf(wiki_base, topic), function(i){

        raw_1 <- html_text(html_nodes(html(i),"p"))

        raw_valid <- raw_1[nchar(raw_1)>0]

        all_info <- lapply(1:length(raw_valid), function(i){
            sprintf(' [paragraph - %d] %s ', i, raw_valid[[i]])
        }) %>% paste0(collapse = "")

        data.frame(wiki_url = i, 
                   topic = basename(i),
                   info_summary = raw_valid[[1]],
                   trimws(all_info),
                   stringsAsFactors = FALSE)

    }) %>% rbind.pages

   > str(my_table)
   'data.frame':    3 obs. of  4 variables:
    $ wiki_url    : chr  "https://en.wikipedia.org/wiki/Neutron star"     "https://en.wikipedia.org/wiki/Black hole" "https://en.wikipedia.org/wiki/sagittarius A"
    $ topic       : chr  "Neutron star" "Black hole" "sagittarius A"
    $ info_summary: chr  "A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and densest stars kno"| __truncated__ "A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even particles and electrom"| __truncated__ "Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constellation Sagittarius"| __truncated__
    $ all_info    : chr  " [paragraph - 1] A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and "| __truncated__ " [paragraph - 1] A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even parti"| __truncated__ " [paragraph - 1] Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constell"| __truncated__

修改

错误处理函数....返回逻辑。所以这成了我们的第一步。

url_works <- function(url){
tryCatch(
    identical(status_code(HEAD(url)),200L), 
    error = function(e){
        FALSE
    })
}

基于您对'系外行星'的使用以下是维基页面中的所有适用数据：

 exo_data <- (html_nodes(html('https://en.wikipedia.org/wiki/List_of_exoplanets'),'.wikitable')%>%html_table)[[2]]

str(exo_data)

    'data.frame':   2048 obs. of  16 variables:
 $ Name                          : chr  "Proxima Centauri b" "KOI-1843.03" "KOI-1843.01" "KOI-1843.02" ...
 $ bf                            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Mass (Jupiter mass)           : num  0.004 0.0014 NA NA 0.1419 ...
 $ Radius (Jupiter radii)        : num  NA 0.054 0.114 0.071 1.012 ...
 $ Period (days)                 : num  11.186 0.177 4.195 6.356 19.224 ...
 $ Semi-major axis (AU)          : num  0.05 0.0048 0.039 0.052 0.143 0.229 0.0271 0.053 1.33 2.1 ...
 $ Ecc.                          : num  0.35 1.012 NA NA 0.0626 ...
 $ Inc. (deg)                    : num  NA 72 89.4 88.2 87.1 ...
 $ Temp. (K)                     : num  234 NA NA NA 707 ...
 $ Discovery method              : chr  "radial vel." "transit" "transit" "transit" ...
 $ Disc. Year                    : int  2016 2012 2012 2012 2010 2010 2010 2014 2009 2005 ...
 $ Distance (pc)                 : num  1.29 NA NA NA 650 ...
 $ Host star mass (solar masses) : num  0.123 0.46 0.46 0.46 1.05 1.05 1.05 0.69 1.25 0.22 ...
 $ Host star radius (solar radii): num  0.141 0.45 0.45 0.45 1.23 1.23 1.23 NA NA NA ...
 $ Host star temp. (K)           : num  3024 3584 3584 3584 5722 ...
 $ Remarks                       : chr  "Closest exoplanet to our Solar System. Within host star’s habitable zone; possibl
 y Earth-like." "controversial" "controversial" "controversial" ...

在表

tests <- dplyr::sample_frac(exo_data, 0.02) %>% .$Name

现在让我们构建一个带有Name的ref表，要检查的url，以及url是否有效的逻辑，并在一个步骤中创建一个包含两个不存在的URL的数据框的列表...和另一个。签出我们可以完成上述功能没有问题。这样，错误处理在我们实际开始尝试在循环中解析之前完成。避免头痛并给出需要进一步研究的项目的参考资料。

b <- ldply(sprintf('https://en.wikipedia.org/wiki/%s',tests), function(i){
data.frame(name = basename(i), url_checked = i,url_valid = url_works(i))
}) %>%split(.$url_valid)

> str(b)
List of 2
 $ FALSE:'data.frame':  24 obs. of  3 variables:
  ..$ name       : chr [1:24] "Kepler-539c" "HD 142 A c" "WASP-44 b" "Kepler-280 b" ...
  ..$ url_checked: chr [1:24] "https://en.wikipedia.org/wiki/Kepler-539c" "https://en.wikipedia.org/wiki/HD 142 A c" "https://en.wikipedia.org/wiki/WASP-44 b" "https://en.wikipedia.org/wiki/Kepler-280 b" ...
  ..$ url_valid  : logi [1:24] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ TRUE :'data.frame':  17 obs. of  3 variables:
  ..$ name       : chr [1:17] "HD 179079 b" "HD 47186 c" "HD 93083 b" "HD 200964 b" ...
  ..$ url_checked: chr [1:17] "https://en.wikipedia.org/wiki/HD 179079 b" "https://en.wikipedia.org/wiki/HD 47186 c" "https://en.wikipedia.org/wiki/HD 93083 b" "https://en.wikipedia.org/wiki/HD 200964 b" ...
  ..$ url_valid  : logi [1:17] TRUE TRUE TRUE TRUE TRUE TRUE ...

显然，列表的第二项包含带有效URL的数据框，因此将先前函数应用于该列中的url列。请注意，为了解释的目的，我对所有行星的表进行了采样...有2400个奇怪的名称，因此检查将需要一两分钟才能运行。希望能为你做好准备。

用循环刮擦并避免404错误

1 个答案:

修改