Question

我正在使用rvest从美国证券交易委员会（US Securities and Exchange Commission）抓取一些公司文件。从一家特定的公司开始，我成功地将URL提取到他们的10k文档中的每个文档中，并将这些URL放入名为xcel的数据框中。然后，我想进一步抓取每个URL。

我认为使用for循环遍历xcel$fullurl列中的每个URL，在每个URL上使用read_html函数，然后从中提取表是最有意义的每页）。

我无法使实际的for循环正常工作。如果您认为不能使用for循环，那么我想听听其他建议。

library(rvest)
library(stringi)

sec<-read_html("https://www.sec.gov/cgi-bin/browse-edgar? 
action=getcompany&CIK=0000072903&type=10-k&dateb=&owner=exclude&count=40")
xcel<- sec %>%
 html_nodes("#documentsbutton") %>%
 html_attr("href")
xcel<-data.frame(xcel)
xcel$xcell<-paste0("https://www.sec.gov",xcel$xcell)
xcel$fullurl<-paste0(xcel$xcell,xcel$xcel)
as.character(xcel$fullurl)      #set of URL's that I want to scrape from

#Problem starts here

for (i in xcel$fullurl){
  pageurl<-xcel$fullurl
  phase2 <- read_html(pageurl[i])

hopefully<-phase2 %>%
   html_table("tbody")

希望这应该给我下面每个表格的内容网站

Answer 1

您可以使用map / lapply遍历每个URL并从每个URL中提取第一个表

library(rvest)
library(dplyr)
library(purrr)

map(xcel$fullurl, ~ .x %>% read_html() %>%  html_table() %>% .[[1]])

#   Seq                   Description                   Document     Type     Size
#1    1                          10-K       xcel1231201510-k.htm     10-K  6375358
#2    2                 EXHIBIT 10.28       xcelex1028q42015.htm EX-10.28    57583
#3    3                 EXHIBIT 10.29       xcelex1029q42015.htm EX-10.29    25233
#4    4                 EXHIBIT 12.01       xcelex1201q42015.htm EX-12.01    50108
#5    5                 EXHIBIT 21.01       xcelex2101q42015.htm EX-21.01    22841
#.....

这将返回数据帧列表。如果要将所有这些元素合并到一个数据框中，则可以使用map_dfr代替map。

通过一列url向read_html编写循环

1 个答案: