我正在尝试在R中抓取网页。在这里的目录中:
https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc
我对
感兴趣 Consolidated Statement of Earnings - Page 50
Consolidated Statement of Cash Flows - Page 51
Consolidated Balance Sheet - Page 52
根据文档的页码,这些语句的位置可能会有所不同。
我正在尝试使用html_nodes()
查找这些文档,但似乎无法正常工作。检查网址时,我在<div align="CENTER"> == $0
处找到了表,但是找不到表ID密钥。
url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"
dat <- url %>%
read_html() %>%
html_table(fill = TRUE)
任何朝正确方向的推动都是很棒的!
编辑:我知道finreportr和finstr软件包,但是它们使用XML文档,并且并非所有.HTML页面都具有XML文档-我也想使用rvest
软件包来做到这一点。
编辑:
类似于以下作品:
url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
html_table()
x <- population[[1]]
它非常混乱,但是确实获得了现金流量表。 Xpath随网页而变化。
例如,这是不同的:
url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
html_table()
x <- population[[1]]
是否可以“搜索”“现金流量”表并以某种方式提取xpath
?
还有更多尝试链接。
[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"
[2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"
[3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"
[4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"
[5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"
[6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"
[7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"
[8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"
[9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"