帖子底部的摘要
PART 1: 我正在尝试修改函数以适合我的数据,但是遇到以下错误:
Error in mutate_impl(.data, dots) :
Evaluation error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 0: Columns `node`, `text`
* Length 2: Column `nid`
Call `rlang::last_error()` to see a backtrace.
我要运行的部分功能:
parse10k <- function(uri) {
# 10-K HTML files are very flat with a long list of nodes. This pulls all
# the relevant nodes.
nodes <- read_html(uri) %>%
html_nodes('text') %>%
xml_children()
nodes <- nodes[xml_name(nodes) != "hr"]
# Unfortunately there isn't much of a workaround to this loop - we need
# to track position in the file so it has to be a bit sequential...
doc.parts <- tibble(nid = seq(length(nodes)),
node = nodes,
text = xml_text(nodes) ) %>%
filter(text != "") # way to get columns defined properly
}
运行功能:
data2 <- df %>%
rename_(ID = ".id") %>%
rowwise() %>%
filter(grepl(".htm", doc.href, fixed = TRUE)) %>%
filter(!grepl(".html", doc.href, fixed = TRUE)) %>%
mutate(nodes = map(doc.href, parse10k)) %>%
#select(-accession_number, -href, -mdlink, -doc.href, -reportLink) %>%
ungroup() %>%
group_by(filing_date)
错误:
Error in mutate_impl(.data, dots) :
Evaluation error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 0: Columns `node`, `text`
* Length 2: Column `nid`
Call `rlang::last_error()` to see a backtrace.
PART 2 问题似乎是由于几个链接所致:我从先前定义的函数中取出了所有内容,并通过它运行了每个链接,可疑链接如下:(第2部分底部的代码)
“错误”链接: https://www.sec.gov/Archives/edgar/data/789019/000119312515272806/d918813d10k.htm
这将返回110个观测值的parts
数据帧...
# A tibble: 110 x 2
nid text
<dbl> <chr>
1 0 PART 0
2 21 "PART I "
3 26 "PART I "
4 41 "PART I "
5 66 "PART I "
6 93 "PART I "
7 126 "PART I "
8 147 "PART I "
9 171 "PART I "
10 191 "PART I "
# ... with 100 more rows
此链接:
“好”链接: https://www.sec.gov/Archives/edgar/data/1045810/000104581009000013/fy2009form10k.htm
返回正确的parts
数。
# A tibble: 4 x 2
nid text
<dbl> <chr>
1 0 PART 0
2 65 PART I
3 651 PART II
4 693 NA
这是我运行链接所通过的代码:
url <- "https://www.sec.gov/Archives/edgar/data/789019/000119312515272806/d918813d10k.htm"
nodes <- NULL; doc.parts <- NULL; parts <- NULL
nodes <- read_html(url) %>%
html_nodes('text') %>%
xml_children()
nodes <- nodes[xml_name(nodes) != "hr"]
doc.parts <- tibble(nid = seq(length(nodes)),
node = nodes,
text = xml_text(nodes) ) %>%
filter(text != "")
parts <- doc.parts %>%
filter(grepl("^part",text, ignore.case=TRUE)) %>%
select(nid,text)
# mutate(next.nid = c(nid[-1],length(nodes)+1)) %>%
if (parts$nid[1] > 1) {
parts <- bind_rows(tibble(nid = 0, text= "PART 0"), parts)
}
parts <- bind_rows(parts,
tibble(nid = doc.parts$nid[length(doc.parts$nid)] + 1,
text = "NA"))
PART 3
我还查看了doc.parts
数据帧,它们是不同的。在node
列下,“好”链接如下:
{xml_nodeset (6)}
[1] <title>fy2009form10k.htm</title>\n
[2] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
[3] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
[4] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
[5] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
[6] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
“错误”链接如下所示:
{xml_nodeset (6)}
[1] <title>10-K</title>\n
[2] <h5 align="left"><a href="#toc">Table of Contents</a></h5>
[3] <div style="width:97%; margin-top:1.5%; margin-left:1.5%; margin-ri ...
[4] <p style="page-break-before:always">\n</p>\n
[5] <h5 align="left"><a href="#toc">Table of Contents</a></h5>
[6] <div style="width:97%; margin-top:1.5%; margin-left:1.5%; margin-ri ...
所以该函数的此部分出了问题(来自第1部分):
doc.parts <- tibble(nid = seq(length(nodes)),
node = nodes,
text = xml_text(nodes) ) %>%
filter(text != "")
特别是xml_text(nodes)
部分。
问题:是否有办法事先知道哪些可能是“不良”链接? -我试图删除所有可能无法通过函数读取的.txt
和.html
链接,但是某些.htm
链接引起了问题。 -我希望不要删除它们,但如有需要,我会删除。 tryCatch()
在这里有用吗?
PART 4 由于网页不同,因此当我再次通过for循环运行链接时,我会获得不同的列表长度。
links <- c("https://www.sec.gov/Archives/edgar/data/27419/000002741914000014/tgt-20140201x10k.htm",
"https://www.sec.gov/Archives/edgar/data/1090012/000095013409003904/d66379e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/789019/000119312511200680/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/789019/000119312504150689/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/dal1231201610k.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746912001478/a2207295z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746905006608/a2152901z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/27419/000104746910002121/a2196751z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/98246/000095012309005683/y75075e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/818479/000081847914000004/dentsply201310-k.htm",
"https://www.sec.gov/Archives/edgar/data/1045810/000104581009000013/fy2009form10k.htm",
"https://www.sec.gov/Archives/edgar/data/789019/000119312515272806/d918813d10k.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746913001494/a2212713z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/nvda-2015x10k.htm",
"https://www.sec.gov/Archives/edgar/data/1045810/000104581011000015/fy2011form10k.htm",
"https://www.sec.gov/Archives/edgar/data/1090012/000119312514076267/d656849d10k.htm"
)
当我运行此循环
nodes <- NULL; doc.parts <- NULL; parts <- NULL
for(link in links){
nodes[[link]] <- read_html(link) %>%
html_nodes('text') %>%
xml_children()
nodes[[link]] <- nodes[xml_name(nodes[[link]]) != "hr"]
doc.parts[[link]] <- tibble(nid = seq(length(nodes)),
node = nodes)
#text = xml_text(nodes) ) %>%
#filter(text != "")
}
我收到此错误:
Error in UseMethod("xml_text") :
no applicable method for 'xml_text' applied to an object of class "list"
这可能也是导致该功能出现问题的原因。
但是,如果我注释掉问题行,我不会收到错误消息:
for(link in links){
nodes[[link]] <- read_html(link) %>%
html_nodes('text') %>%
xml_children()
nodes[[link]] <- nodes[xml_name(nodes[[link]]) != "hr"]
doc.parts[[link]] <- tibble(nid = seq(length(nodes[[link]])))
#node = nodes[[link]])
#text = xml_text(nodes[[link]]) ) %>%
#filter(text != "")
}
doc.parts
是不同长度的列表。
我认为主要问题是由于网页不同而导致的,该功能不知道如何处理一种网页,这也影响了这一部分。
#数据:
df <- structure(list(.id = c("TGT", "DVN", "XRAY", "XRAY", "MSFT",
"MSFT", "DAL", "AON", "AON", "TGT", "TGT", "TIF", "XRAY", "NVDA",
"MSFT", "AON", "MSFT", "NVDA", "NVDA", "DVN"), accession_number = c("0000027419-14-000014",
"0000950134-09-003904", "0000818479-04-000031", "0000818479-99-000003",
"0001193125-11-200680", "0001193125-04-150689", "0000027904-17-000004",
"0001047469-12-001478", "0001047469-05-006608", "0001047469-10-002121",
"0001047469-98-015191", "0000950123-09-005683", "0000818479-14-000004",
"0001045810-09-000013", "0001193125-15-272806", "0001047469-13-001494",
"0000891020-95-000433", "0001045810-15-000036", "0001045810-11-000015",
"0001193125-14-076267"), act = c("34", "34", NA, NA, "34", NA,
"34", "34", "34", "34", NA, "34", "34", "34", "34", "34", NA,
"34", "34", "34"), file_number = c("001-06049", "001-32318",
"000-16211", "000-16211", "000-14278", "000-14278", "001-05424",
"001-07933", "001-07933", "001-06049", "001-06049", "001-09494",
"000-16211", "000-23985", "000-14278", "001-07933", "000-14278",
"000-23985", "000-23985", "001-32318"), filing_date = structure(c(1394751600,
1235689200, 1079305200, 922744800, 1311804000, 1093989600, 1486940400,
1330038000, 1110927600, 1268348400, 892591200, 1238364000, 1392850800,
1236898800, 1438293600, 1361487600, 811983600, 1426114800, 1300230000,
1393542000), class = c("POSIXct", "POSIXt"), tzone = ""), accepted_date = structure(c(1394751600,
1235689200, 1079305200, 922744800, 1311804000, 1093989600, 1486940400,
1330038000, 1110841200, 1268348400, 892591200, 1238364000, 1392850800,
1236898800, 1438293600, 1361487600, 811983600, 1426028400, 1300230000,
1393542000), class = c("POSIXct", "POSIXt"), tzone = ""), href = c("https://www.sec.gov/Archives/edgar/data/27419/000002741914000014/0000027419-14-000014-index.htm",
"https://www.sec.gov/Archives/edgar/data/1090012/000095013409003904/0000950134-09-003904-index.htm",
"https://www.sec.gov/Archives/edgar/data/818479/000081847904000031/0000818479-04-000031-index.htm",
"https://www.sec.gov/Archives/edgar/data/818479/0000818479-99-000003-index.html",
"https://www.sec.gov/Archives/edgar/data/789019/000119312511200680/0001193125-11-200680-index.htm",
"https://www.sec.gov/Archives/edgar/data/789019/000119312504150689/0001193125-04-150689-index.htm",
"https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004-index.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746912001478/0001047469-12-001478-index.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746905006608/0001047469-05-006608-index.htm",
"https://www.sec.gov/Archives/edgar/data/27419/000104746910002121/0001047469-10-002121-index.htm",
"https://www.sec.gov/Archives/edgar/data/27419/0001047469-98-015191-index.html",
"https://www.sec.gov/Archives/edgar/data/98246/000095012309005683/0000950123-09-005683-index.htm",
"https://www.sec.gov/Archives/edgar/data/818479/000081847914000004/0000818479-14-000004-index.htm",
"https://www.sec.gov/Archives/edgar/data/1045810/000104581009000013/0001045810-09-000013-index.htm",
"https://www.sec.gov/Archives/edgar/data/789019/000119312515272806/0001193125-15-272806-index.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746913001494/0001047469-13-001494-index.htm",
"https://www.sec.gov/Archives/edgar/data/789019/0000891020-95-000433-index.html",
"https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/0001045810-15-000036-index.htm",
"https://www.sec.gov/Archives/edgar/data/1045810/000104581011000015/0001045810-11-000015-index.htm",
"https://www.sec.gov/Archives/edgar/data/1090012/000119312514076267/0001193125-14-076267-index.htm"
), type = c("10-K", "10-K", "10-K", "10-K", "10-K", "10-K", "10-K",
"10-K", "10-K", "10-K", "10-K", "10-K", "10-K", "10-K", "10-K",
"10-K", "10-K", "10-K", "10-K", "10-K"), film_number = c("14693644",
"09639574", "04670190", "99578860", "11993262", "041011640",
"17600107", "12638817", "05683013", "10676542", "98594743", "09714434",
"14630484", "09677521", "151019135", "13634337", "95575998",
"15694143", "11692266", "14653539"), form_name = c("Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]", "Annual report [Section 13 and 15(d), not S-K Item 405]",
"Annual report [Section 13 and 15(d), not S-K Item 405]"), description = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
size = c("20 MB", "2 MB", "687 KB", "309 KB", "16 MB", "1 MB",
"14 MB", "22 MB", "2 MB", "6 MB", "201 KB", "1 MB", "35 MB",
"4 MB", "14 MB", "24 MB", "189 KB", "16 MB", "19 MB", "41 MB"
), doc.href = c("https://www.sec.gov/Archives/edgar/data/27419/000002741914000014/tgt-20140201x10k.htm",
"https://www.sec.gov/Archives/edgar/data/1090012/000095013409003904/d66379e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/818479/000081847904000031/f102003.txt",
"https://www.sec.gov/Archives/edgar/data/818479/", "https://www.sec.gov/Archives/edgar/data/789019/000119312511200680/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/789019/000119312504150689/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/dal1231201610k.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746912001478/a2207295z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746905006608/a2152901z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/27419/000104746910002121/a2196751z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/27419/", "https://www.sec.gov/Archives/edgar/data/98246/000095012309005683/y75075e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/818479/000081847914000004/dentsply201310-k.htm",
"https://www.sec.gov/Archives/edgar/data/1045810/000104581009000013/fy2009form10k.htm",
"https://www.sec.gov/Archives/edgar/data/789019/000119312515272806/d918813d10k.htm",
"https://www.sec.gov/Archives/edgar/data/315293/000104746913001494/a2212713z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/789019/", "https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/nvda-2015x10k.htm",
"https://www.sec.gov/Archives/edgar/data/1045810/000104581011000015/fy2011form10k.htm",
"https://www.sec.gov/Archives/edgar/data/1090012/000119312514076267/d656849d10k.htm"
), mdlink = c("[Filing Link](https://www.sec.gov/Archives/edgar/data/27419/000002741914000014/0000027419-14-000014-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/1090012/000095013409003904/0000950134-09-003904-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/818479/000081847904000031/0000818479-04-000031-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/818479/0000818479-99-000003-index.html)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/789019/000119312511200680/0001193125-11-200680-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/789019/000119312504150689/0001193125-04-150689-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/315293/000104746912001478/0001047469-12-001478-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/315293/000104746905006608/0001047469-05-006608-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/27419/000104746910002121/0001047469-10-002121-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/27419/0001047469-98-015191-index.html)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/98246/000095012309005683/0000950123-09-005683-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/818479/000081847914000004/0000818479-14-000004-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/1045810/000104581009000013/0001045810-09-000013-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/789019/000119312515272806/0001193125-15-272806-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/315293/000104746913001494/0001047469-13-001494-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/789019/0000891020-95-000433-index.html)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/0001045810-15-000036-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/1045810/000104581011000015/0001045810-11-000015-index.htm)",
"[Filing Link](https://www.sec.gov/Archives/edgar/data/1090012/000119312514076267/0001193125-14-076267-index.htm)"
), reportLink = c("[10-K Link](https://www.sec.gov/Archives/edgar/data/27419/000002741914000014/tgt-20140201x10k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/1090012/000095013409003904/d66379e10vk.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/818479/000081847904000031/f102003.txt)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/818479/)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/789019/000119312511200680/d10k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/789019/000119312504150689/d10k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/dal1231201610k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/315293/000104746912001478/a2207295z10-k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/315293/000104746905006608/a2152901z10-k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/27419/000104746910002121/a2196751z10-k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/27419/)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/98246/000095012309005683/y75075e10vk.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/818479/000081847914000004/dentsply201310-k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/1045810/000104581009000013/fy2009form10k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/789019/000119312515272806/d918813d10k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/315293/000104746913001494/a2212713z10-k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/789019/)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/nvda-2015x10k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/1045810/000104581011000015/fy2011form10k.htm)",
"[10-K Link](https://www.sec.gov/Archives/edgar/data/1090012/000119312514076267/d656849d10k.htm)"
)), row.names = c(64L, 158L, 143L, 148L, 90L, 97L, 109L,
24L, 31L, 68L, 80L, 49L, 133L, 10L, 86L, 23L, 106L, 4L, 8L, 153L
), class = "data.frame")
EDIT1:一些软件包:
library(dplyr)
library(plyr)
library(purrr)
library(edgarWebR)
library(rvest)
library(devtools)
library(tidyr)
library(tidytext)
library(stringr)
library(tibble)
EDIT2 :(摘要)
我遇到的问题是我试图读取许多.htm
链接。对于大多数链接而言,所有内容都运行平稳且正确,但是,如果链接列表中存在一些“不良”链接,则整个函数将引发错误。我已经分析了问题,并且我认为错误来自代码的一部分,尤其是doc.parts
这部分,并且该代码试图读取两个不同的HTML / XML代码。
使用“错误的” URL,doc.parts
-node column
的格式如下:
> head(doc.parts$node)
{xml_nodeset (6)}
[1] <title>10-K</title>\n
[2] <h5 align="left"><a href="#toc">Table of Contents</a></h5>
[3] <div style="width:97%; margin-top:1.5%; margin-left:1.5%; margin-ri ...
[4] <p style="page-break-before:always">\n</p>\n
[5] <h5 align="left"><a href="#toc">Table of Contents</a></h5>
[6] <div style="width:97%; margin-top:1.5%; margin-left:1.5%; margin-ri ...
这引起了各种各样的问题。但是,当我运行“好” URL时,同一列如下所示:
> head(doc.parts$node)
{xml_nodeset (6)}
[1] <title>fy2009form10k.htm</title>\n
[2] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
[3] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
[4] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
[5] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
[6] <div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MAR ...
该函数的其余部分可以处理。我认为编写一些代码以在“错误的” URLS中读取将太复杂了,而且似乎只有少数几个。我认为最好以某种方式忘记错误的URL会更好。
编辑3:
以下“错误”网址将作为large xml_nodes
对象读取。
url <- "https://www.sec.gov/Archives/edgar/data/789019/000119312515272806/d918813d10k.htm"
nodes <- read_html(url) %>%
html_nodes('text') %>%
xml_children()
“好”网址显示为692列表
url2 <- "https://www.sec.gov/Archives/edgar/data/1045810/000104581009000013/fy2009form10k.htm"
nodes2 <- read_html(url2) %>%
html_nodes('text') %>%
xml_children()
答案 0 :(得分:1)
我认为是我在2007年进行的处理SEC文件HTML的探索中尝试使用的代码的作者-https://micah.waldste.in/blog/2017/10/introduction-to-sentiment-analysis-of-10-k-reports-in-r/
tl; dr;不要使用此代码,请使用edgarWebR R库,该库基于此方法构建,并且在浏览SEC网站以及解析文件和表格方面更加可靠。
但是对于遇到此问题或相关问题的人,让我指出一些您所看到的错误。
我认为每个人在某个时候都会对此感到困惑-在创建表的代码中存在此块-
doc.parts <- tibble(nid = seq(length(nodes)),
node = nodes,
text = xml_text(nodes) ) %>%
...
错误是,在遇到nodes
的长度为0的情况下。xml_text(nodes)
的长度也为0,但是nid
发生了一件非常有趣的事情……我们希望它也为长度0,seq(0)
返回c(1, 0)
,长度2 ...
故事的寓意:如果您想要列表中项目的标识符,请使用seq_along(nodes)
而不是seq(length(nodes))
,以便在长度为0的情况下不会中断。
另一个紧迫的问题应该是:“为什么解析不给出任何长度?将其创建为0长度的结果是不好的”。
不幸的是,尽管SEC档案采用HTML标准化,但它们却是地球上最丑陋,标准化程度最低的HTML。每个公司都会做一些不同的事情,或者将这些“标准化形式”作为“品牌”的机会。解析它们比蛮力处理更费力。
这就是为什么您尝试的代码有时会工作,有时会失败的原因-取决于特定文件格式的准确程度。
如果您要解析SEC文件,请使用R edgarWebR软件包-我们一直在解决某些特定的解析问题-看起来很美,但是可以。