几周前,这里有人帮助我极大地获得了Notable Names数据库中所有链接的列表。我能够运行此代码并获得以下输出
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
html_nodes(".newslink") %>%
html_attrs() %>%
map(pluck(1, 1))
all_ppl_urls <- map(
all_surname_urls,
function(x) read_html(x) %>%
html_nodes("a") %>%
html_attrs() %>%
map(pluck(1, 1))
) %>%
unlist()
all_ppl_urls <- setdiff(
all_ppl_urls[!duplicated(all_ppl_urls)],
c(all_surname_urls, "http://www.nndb.com/")
)
all_ppl_urls[1] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
# [1] "AKA Lee William Aaker"
# [2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
# [3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
# [4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
# ...
我最初的意图是获取一个数据框,在该数据框中,我将获得该人的姓名,其性别,种族,职业和国籍到一个数据框中。
如果您的数据位于html表中,那么我在这里和其他站点上看到的很多问题都会有所帮助,而著名的名称数据库却不是这样。我知道所有4万个站点都需要涉及一个循环,但是经过一个周末的寻找答案之后,我似乎找不到答案。有人可以协助吗?
编辑后添加 我尝试遵循此处的一些规则,但是this request有点复杂
## I tried to run list <- all_ppl_urls%>% map(read_html) but that was taking a LONG time so I decided to just get the first ten links for the sake of showing my example:
example <- head(all_ppl_urls, 10)
list <- example %>% map(read_html)
test <-list %>% map_df(~{
text_1 <- html_nodes(.x, 'p , b') %>% html_text
,我得到了这个错误: 错误: 另外:警告消息: 关闭未使用的连接3(http://www.nndb.com/people/965/000279128/)
答案 0 :(得分:3)
更新
包括了无法正确解析的配置文件的错误例程。如果有任何错误,您将获得一个NA
行(即使可以正确地解析某些信息,这是由于我们立即读取所有字段并且我们依赖于可以读取所有字段)。
也许您想进一步开发该代码以返回部分信息?您可以通过一次又一次(而不是一次)读取这些字段来完成此操作,如果有错误,则返回该字段而不是整个行的NA。但是,这样做的缺点是,代码不仅需要每个配置文件解析一次文档,还需要解析几次。
这是一个依靠Xpath
选择相关字段的功能:
library(rvest)
library(glue)
library(tibble)
library(dplyr)
library(purrr)
scrape_profile <- function(url) {
fields <- c("Gender:", "Race or Ethnicity:", "Occupation:", "Nationality:")
filter <- glue("contains(text(), '{fields}')") %>%
paste0(collapse = " or ")
xp_string <- glue("//b[{filter}]/following::text()[normalize-space()!=''][1]")
tryCatch({
doc <- read_html(url)
name <- doc %>%
html_node(xpath = "(//b/text())[1]") %>%
as.character()
doc %>%
html_nodes(xpath = xp_string) %>%
as.character() %>%
gsub("^\\s|\\s$", "", .) %>%
as.list() %>%
setNames(c("Gender", "Race", "Occupation", "Nationality")) %>%
as_tibble() %>%
mutate(Name = name) %>%
select(Name, everything())
}, error = function(err) {
message(glue("Profile <{url}> could not be parsed properly."))
tibble(Name = ifelse(exists("name"), name, NA), Gender = NA,
Race = NA, Occupation = NA,
Nationality = NA)
})
}
您现在要做的就是将scrape_profile
应用于您的所有个人资料网址:
map_dfr(all_ppl_urls[1:5], scrape_profile)
# # A tibble: 5 x 5
# Name Gender Race Occupation Nationality
# <chr> <chr> <chr> <chr> <chr>
# 1 Lee Aaker Male White Actor United States
# 2 Aaliyah Female Black Singer United States
# 3 Alvar Aalto Male White Architect Finland
# 4 Willie Aames Male White Actor United States
# 5 Kjetil André Aamodt Male White Skier Norway
说明
<b>
标签),有时还会有一个链接标记(<a>
)。css
或XPath
选择器。但是,由于我们要选择文本节点,因此XPath
似乎是唯一的(?)选项://b[contains(text(), "Gender:")]/following::text()[normalize-space()!=' '][1]
选择
::text()[normalize-space()!=' '][1]
/following
)<b>
标记(//b)
,其中Gender:
([contains(text(), "Gender:")]
)Xpath
来匹配多个元素,从而避免显式循环。为此,我们将几条contains(.)
语句粘贴在一起,并用or
分隔tibble
<b>
)文本答案 1 :(得分:2)
这里您可以通过查看每个html文件来获取数据。这只是一种可以获得良好结果的方法...但是...您必须注意,应该编辑这些gsub函数以获得更好的结果。发生这种情况是因为该网址列表(或说该网页)在数据显示方式上不统一。这是您必须处理的事情。例如,以下是两个屏幕截图,您可以在其中找到Web演示中的那些差异:
无论如何,您可以通过修改以下代码来进行管理:
library(purrr)
library(rvest)
[...] #here is your data
all_ppl_urls[100] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
# [3] "Gender: MaleReligion: Eastern OrthodoxRace or Ethnicity: Middle EasternSexual orientation: StraightOccupation: PoliticianParty Affiliation: Republican"
#-----------------------------------------------------------------------------------------------
# NEW WAY
toString(read_html(all_ppl_urls[100])) #get example of how html looks...
#><b>AKA</b> Edmund Spencer Abraham</p>\n<p><b>Born:</b> <a href=\"/lists/681/000106363/\" class=\"proflink\">12-Jun</a>-<a href=\"/lists/951/000105636/\" class=\"proflink\">1952</a><br><b>Birthplace:</b> <a href=\"/geo/604/000080364/\" class=\"proflink\">East Lansing, MI</a><br></p>\n<p><b>Gender:</b> Male<br><b>
#1. remove NA urls (avoid problems later on)
urls <- all_ppl_urls[!is.na(all_ppl_urls)]
length(all_ppl_urls)
length(urls)
#function that creates a list with your data
GetLife <- function (htmlurl) {
htmltext <- toString(read_html(htmlurl))
name <- gsub('^.*AKA</b>\\s*|\\s*</p>\n.*$', '', htmltext)
gender <- gsub('^.*Gender:</b>\\s*|\\s*<br>.*$', '', htmltext)
race <- gsub('^.*Race or Ethnicity:</b>\\s*|\\s*<br>.*$', '', htmltext)
occupation <- gsub('^.*Occupation:</b>\\s*|\\s*<br>.*$|\\s*</a>.*$|\\s*</p>.*$', '', htmltext)
#as occupation seems to have to many hyperlinks we are making another step
occupation <- gsub("<[^>]+>", "",occupation)
nationality <- gsub('^.*Nationality:</b>\\s*|\\s*<br>.*$', '', htmltext)
res <- c(ifelse(nchar(name)>100, NA, name), #function that cleans weird results >100 chars
ifelse(nchar(gender)>100, NA, gender),
ifelse(nchar(race)>100, NA, race),
ifelse(nchar(occupation)>100, NA, occupation),
ifelse(nchar(nationality)>100, NA, nationality),
htmlurl)
return(res)
}
emptydf <- data.frame(matrix(ncol=6, nrow=0)) #creaty empty data frame
colnames(emptydf) <- c("name","gender","race","occupation","nationality","url") #set names in empty data frame
urls <- urls[2020:2030] #sample some of the urls
for (i in 1:length(urls)){
emptydf[i,] <- GetLife(urls[i])
}
emptydf
以下是这10个网址经过分析的示例:
name gender race occupation nationality url
1 <NA> Male White Business United States http://www.nndb.com/people/214/000128827/
2 Mark Alexander Ballas, Jr. Male White Dancer United States http://www.nndb.com/people/162/000346121/
3 Thomas Cass Ballenger Male White Politician United States http://www.nndb.com/people/354/000032258/
4 Severiano Ballesteros Sota Male Hispanic Golf Spain http://www.nndb.com/people/778/000116430/
5 Richard Achilles Ballinger Male White Government United States http://www.nndb.com/people/511/000168007/
6 Steven Anthony Ballmer Male White Business United States http://www.nndb.com/people/644/000022578/
7 Edward Michael Balls Male White Politician England http://www.nndb.com/people/846/000141423/
8 <NA> Male White Judge United States http://www.nndb.com/people/533/000168029/
9 <NA> Male Asian Engineer England http://www.nndb.com/people/100/000123728/
10 Michael A. Balmuth Male White Business United States http://www.nndb.com/people/635/000175110/
11 Aristotle N. Balogh Male White Business United States http://www.nndb.com/people/311/000172792/