R XML href来自SEC Edgar网站

时间:2016-06-19 22:58:10

标签: r web-scraping html-parsing href extract

我之前检查过类似的问题 - 没有运气......似乎可以readHTMLTable阅读Edgar网页。我正在尝试阅读此网址:

https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100

...并获取" Documents"下的所有href链接。按钮到字符向量。

"文件"链接在表格中 - 来自Firefox检测工具的第一个"文档" href链接如下所示:



<div id="seriesDiv" style="margin-top: 0px;">

    <table class="tableFile2" summary="Results">
        <tbody>
            <tr></tr>
            <tr>
                <td nowrap="nowrap"></td>
                <td nowrap="nowrap">
                    <a id="documentsbutton" href="/Archives/edgar/data/320193/000119312516559625/0001193125-16-559625-index.htm">

                         Documents
&#13;
&#13;
&#13;

所以我想把href链接变成一个字符向量供以后使用。

问题 - XML库给我带来麻烦,htmltab库函数由于某种原因似乎无法在我的R实例中被识别。

这是我的代码:

library(XML)
EDGARURL <- "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100"
EDGARHREFtables <- readHTMLTable(EDGARURL, as.data.frame = TRUE)

导致以下错误:

    Warning message:
XML content does not seem to be XML: 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100'

我错过了什么? XML图书馆readHTMLTable会对此有用吗?如果是这样,你如何提取每个文件的href标签?

1 个答案:

答案 0 :(得分:1)

对于简单的工作,rvest包很容易:

library(rvest)

url <- 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100'

        # pull HTML from page  
url %>% read_html() %>%
    # get tags with a certain CSS selector
    html_nodes('#documentsbutton') %>%
    # get the href attribute from each node
    html_attr('href')

# [1] "/Archives/edgar/data/320193/000119312516559625/0001193125-16-559625-index.htm"
# [2] "/Archives/edgar/data/320193/000119312516439878/0001193125-16-439878-index.htm"
# [3] "/Archives/edgar/data/320193/000119312515259935/0001193125-15-259935-index.htm"
# [4] "/Archives/edgar/data/320193/000119312515153166/0001193125-15-153166-index.htm"
# [5] "/Archives/edgar/data/320193/000119312515023697/0001193125-15-023697-index.htm"
# [6] "/Archives/edgar/data/320193/000119312514277160/0001193125-14-277160-index.htm"
# [7] "/Archives/edgar/data/320193/000119312514157311/0001193125-14-157311-index.htm"
# [8] "/Archives/edgar/data/320193/000119312514024487/0001193125-14-024487-index.htm"
# [9] "/Archives/edgar/data/320193/000119312513300670/0001193125-13-300670-index.htm"
# [10] "/Archives/edgar/data/320193/000119312513168288/0001193125-13-168288-index.htm"
# ...