特定表的RVest read_html

时间:2019-03-15 16:47:44

标签: r

我正在尝试在R中抓取网页。在这里的目录中:

https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc

我对

感兴趣

Consolidated Statement of Earnings - Page 50 Consolidated Statement of Cash Flows - Page 51 Consolidated Balance Sheet - Page 52

根据文档的页码,这些语句的位置可能会有所不同。

我正在尝试使用html_nodes()查找这些文档,但似乎无法正常工作。检查网址时,我在<div align="CENTER"> == $0处找到了表,但是找不到表ID密钥。

url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"


dat <- url %>%
  read_html() %>%
  html_table(fill = TRUE)

任何朝正确方向的推动都是很棒的!

编辑:我知道finreportr和finstr软件包,但是它们使用XML文档,并且并非所有.HTML页面都具有XML文档-我也想使用rvest软件包来做到这一点。

编辑:

类似于以下作品:

    url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
    population <- url %>%
      read_html() %>%
      html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
      html_table()
x <- population[[1]]

它非常混乱,但是确实获得了现金流量表。 Xpath随网页而变化。

例如,这是不同的:

url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"

population <- url %>%
  read_html() %>%
  html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
  html_table()

x <- population[[1]]

是否可以“搜索”“现金流量”表并以某种方式提取xpath

还有更多尝试链接。

[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"   
 [2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"    
 [3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"  
 [4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"   
 [5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"      
 [6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"   
 [7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"          
 [8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"    
 [9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"    
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"

0 个答案:

没有答案