单词后获取文本--R Web抓取

时间:2019-04-08 11:19:27

标签: r web-scraping

几周前,这里有人帮助我极大地获得了Notable Names数据库中所有链接的列表。我能够运行此代码并获得以下输出

library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"    
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
      html_nodes(".newslink") %>%
      html_attrs() %>%
      map(pluck(1, 1))

all_ppl_urls <- map(
      all_surname_urls, 
      function(x) read_html(x) %>%
        html_nodes("a") %>%
        html_attrs() %>%
        map(pluck(1, 1))
    ) %>% 
      unlist()

all_ppl_urls <- setdiff(
      all_ppl_urls[!duplicated(all_ppl_urls)], 
      c(all_surname_urls, "http://www.nndb.com/")
    )

all_ppl_urls[1] %>%
      read_html() %>%
      html_nodes("p") %>%
      html_text()

# [1] "AKA Lee William Aaker"
# [2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
# [3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
# [4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
# ...

我最初的意图是获取一个数据框,在该数据框中,我将获得该人的姓名,其性别种族职业国籍到一个数据框中。

如果您的数据位于html表中,那么我在这里和其他站点上看到的很多问题都会有所帮助,而著名的名称数据库却不是这样。我知道所有4万个站点都需要涉及一个循环,但是经过一个周末的寻找答案之后,我似乎找不到答案。有人可以协助吗?

编辑后添加 我尝试遵循此处的一些规则,但是this request有点复杂

## I tried to run list <- all_ppl_urls%>% map(read_html) but that was taking a LONG time so I decided to just get the first ten links for the sake of showing my example:

example <- head(all_ppl_urls, 10)

list  <- example %>% map(read_html)

test <-list  %>% map_df(~{
   text_1 <- html_nodes(.x, 'p , b') %>% html_text

,我得到了这个错误: 错误: 另外:警告消息: 关闭未使用的连接3(http://www.nndb.com/people/965/000279128/

2 个答案:

答案 0 :(得分:3)

更新

包括了无法正确解析的配置文件的错误例程。如果有任何错误,您将获得一个NA行(即使可以正确地解析某些信息,这是由于我们立即读取所有字段并且我们依赖于可以读取所有字段)。

也许您想进一步开发该代码以返回部分信息?您可以通过一次又一次(而不是一次)读取这些字段来完成此操作,如果有错误,则返回该字段而不是整个行的NA。但是,这样做的缺点是,代码不仅需要每个配置文件解析一次文档,还需要解析几次。


这是一个依靠Xpath选择相关字段的功能:

library(rvest)
library(glue)
library(tibble)
library(dplyr)
library(purrr)

scrape_profile <- function(url) {
   fields <- c("Gender:", "Race or Ethnicity:", "Occupation:", "Nationality:")
   filter <- glue("contains(text(), '{fields}')") %>%
                  paste0(collapse = " or ")
   xp_string <- glue("//b[{filter}]/following::text()[normalize-space()!=''][1]") 
   tryCatch({
      doc <- read_html(url)
      name <- doc %>%
                html_node(xpath = "(//b/text())[1]") %>% 
                as.character()
      doc %>%
         html_nodes(xpath = xp_string) %>%
         as.character() %>%
         gsub("^\\s|\\s$", "", .) %>%
         as.list() %>%
         setNames(c("Gender", "Race", "Occupation", "Nationality")) %>%
         as_tibble() %>%
         mutate(Name = name) %>%
         select(Name, everything())
   }, error = function(err) {
      message(glue("Profile <{url}> could not be parsed properly."))
      tibble(Name = ifelse(exists("name"), name, NA), Gender = NA,
             Race = NA, Occupation = NA,
             Nationality = NA)
   })
}

您现在要做的就是将scrape_profile应用于您的所有个人资料网址:

map_dfr(all_ppl_urls[1:5], scrape_profile)
# # A tibble: 5 x 5
#   Name                Gender Race  Occupation Nationality  
#   <chr>               <chr>  <chr> <chr>      <chr>        
# 1 Lee Aaker           Male   White Actor      United States
# 2 Aaliyah             Female Black Singer     United States
# 3 Alvar Aalto         Male   White Architect  Finland      
# 4 Willie Aames        Male   White Actor      United States
# 5 Kjetil André Aamodt Male   White Skier      Norway 

说明

  1. 识别网站的结构:查看个人资料网站的源代码时,您会看到所有相关信息,但名称后面带有粗体标签(即<b>标签),有时还会有一个链接标记(<a>)。
  2. 构造选择器:有了这些信息,我们现在可以构造一个cssXPath选择器。但是,由于我们要选择文本节点,因此XPath似乎是唯一的(?)选项://b[contains(text(), "Gender:")]/following::text()[normalize-space()!=' '][1]选择
    • 第一个非空文本节点::text()[normalize-space()!=' '][1]
    • 的一个兄弟姐妹(/following
    • 一个<b>标记(//b),其中
    • 包含文本Gender:[contains(text(), "Gender:")]
  3. 多重选择:由于所有标记都是以相同的方式构建的,因此我们可以构造一个Xpath来匹配多个元素,从而避免显式循环。为此,我们将几条contains(.)语句粘贴在一起,并用or分隔
  4. 其他格式:最后,我们删除空格并将其返回到tibble
  5. 名称字段:最后一步是提取名称,该名称基本上是第一个粗体(<b>)文本

答案 1 :(得分:2)

这里您可以通过查看每个html文件来获取数据。这只是一种可以获得良好结果的方法...但是...您必须注意,应该编辑这些gsub函数以获得更好的结果。发生这种情况是因为该网址列表(或说该网页)在数据显示方式上不统一。这是您必须处理的事情。例如,以下是两个屏幕截图,您可以在其中找到Web演示中的那些差异:

enter image description here

enter image description here

无论如何,您可以通过修改以下代码来进行管理:

library(purrr)
library(rvest)

[...] #here is your data

all_ppl_urls[100] %>%
    read_html() %>%
    html_nodes("p") %>%
    html_text()
# [3] "Gender: MaleReligion: Eastern OrthodoxRace or Ethnicity: Middle EasternSexual orientation: StraightOccupation: PoliticianParty Affiliation: Republican"  

#-----------------------------------------------------------------------------------------------
# NEW WAY
toString(read_html(all_ppl_urls[100])) #get example of how html looks...
#><b>AKA</b> Edmund Spencer Abraham</p>\n<p><b>Born:</b> <a href=\"/lists/681/000106363/\" class=\"proflink\">12-Jun</a>-<a href=\"/lists/951/000105636/\" class=\"proflink\">1952</a><br><b>Birthplace:</b> <a href=\"/geo/604/000080364/\" class=\"proflink\">East Lansing, MI</a><br></p>\n<p><b>Gender:</b> Male<br><b>

#1. remove NA urls (avoid problems later on)
urls <- all_ppl_urls[!is.na(all_ppl_urls)]
length(all_ppl_urls)
length(urls)

#function that creates a list with your data
GetLife <- function (htmlurl) {
    htmltext <- toString(read_html(htmlurl))
    name <- gsub('^.*AKA</b>\\s*|\\s*</p>\n.*$', '', htmltext)
    gender <- gsub('^.*Gender:</b>\\s*|\\s*<br>.*$', '', htmltext)
    race <- gsub('^.*Race or Ethnicity:</b>\\s*|\\s*<br>.*$', '', htmltext)
    occupation <- gsub('^.*Occupation:</b>\\s*|\\s*<br>.*$|\\s*</a>.*$|\\s*</p>.*$', '', htmltext)
    #as occupation seems to have to many hyperlinks we are making another step
    occupation <- gsub("<[^>]+>", "",occupation)
    nationality <- gsub('^.*Nationality:</b>\\s*|\\s*<br>.*$', '', htmltext)
    res <- c(ifelse(nchar(name)>100, NA, name), #function that cleans weird results >100 chars
             ifelse(nchar(gender)>100, NA, gender),
             ifelse(nchar(race)>100, NA, race),
             ifelse(nchar(occupation)>100, NA, occupation),
             ifelse(nchar(nationality)>100, NA, nationality),
             htmlurl)
    return(res)
}

emptydf <- data.frame(matrix(ncol=6, nrow=0)) #creaty empty data frame
colnames(emptydf) <- c("name","gender","race","occupation","nationality","url") #set names in empty data frame
urls <- urls[2020:2030] #sample some of the urls
for (i in 1:length(urls)){
    emptydf[i,] <- GetLife(urls[i])
}
emptydf

以下是这10个网址经过分析的示例:

name gender     race occupation   nationality                                       url
1                        <NA>   Male    White   Business United States http://www.nndb.com/people/214/000128827/
2  Mark Alexander Ballas, Jr.   Male    White     Dancer United States http://www.nndb.com/people/162/000346121/
3       Thomas Cass Ballenger   Male    White Politician United States http://www.nndb.com/people/354/000032258/
4  Severiano Ballesteros Sota   Male Hispanic       Golf         Spain http://www.nndb.com/people/778/000116430/
5  Richard Achilles Ballinger   Male    White Government United States http://www.nndb.com/people/511/000168007/
6      Steven Anthony Ballmer   Male    White   Business United States http://www.nndb.com/people/644/000022578/
7        Edward Michael Balls   Male    White Politician       England http://www.nndb.com/people/846/000141423/
8                        <NA>   Male    White      Judge United States http://www.nndb.com/people/533/000168029/
9                        <NA>   Male    Asian   Engineer       England http://www.nndb.com/people/100/000123728/
10         Michael A. Balmuth   Male    White   Business United States http://www.nndb.com/people/635/000175110/
11        Aristotle N. Balogh   Male    White   Business United States http://www.nndb.com/people/311/000172792/