Question

我正在尝试使用rvest消除大量辩论。辩论是在不同的网页上进行的，我从搜索结果中收集了这些网页的网址。搜索结果超过1000页，其中有20,000页辩论（即20,000个网址）。

我目前的方法成功地从辩论页面中获取了我需要的数据，但是，对于超过20页的搜索结果（即20,000个网址中只有400个），处理过程将花费非常长的时间。

我目前正在使用一个forloop，它会循环访问我的URL列表，并用我需要的内容抓取5个html节点（请参见下文）。这会为要抓取的内容的每个节点创建一个向量，然后将其合并到一个数据帧中以进行分析。我认为这种方法意味着我需要为每个网页分别调用5次不同的html节点。

有什么方法可以更有效地刮擦它？我确信有一种方法可以做到，这样它将在一次调用每个URL时将所有5个节点都刮掉，而不是重复5次。同样，有可能在for循环中动态填充数据帧，而不是存储5个不同的向量。另外，也许我可以使用并行处理来同时抓取多个URL？我很困惑。

#create empty
speakerid <- c()
parties <- c()
contributions <- c()
titles <- c()
debatedates <- c()

#for loop to scrape relevant content
for(i in debate_urls$url) { 

  debate_urls <- read_html(i)
  speaker <- debate_urls %>% html_nodes(".debate-speech__speaker__name") %>% html_text("")
  speakerid = append(speakerid, speaker)

  debate_urls <- read_html(i)
  party <- debate_urls %>% html_nodes(".debate-speech__speaker__position") %>% html_text("")
  parties = append(parties, party)

  debate_urls <- read_html(i)
  contribution <- debate_urls %>% html_nodes(".debate-speech__speaker+ .debate-speech__content") %>% html_text("p")
  contributions = append(contributions, contribution)

  debate_urls <- read_html(i)
  title <- debate_urls %>%
    html_node(".full-page__unit h1") %>%
    html_text()
  titles = append(titles, rep(title,each=length(contribution)))

  debate_urls <- read_html(i)
  debatedate <- debate_urls %>%
    html_node(".time") %>%
    html_text("href")
  debatedates = append(debatedates, rep(debatedate,each=length(contribution)))
  }

debatedata <- data.frame(Title=titles, Date=debatedates,Speaker=speakerid,Party=parties,Utterance=contributions)

注意：辩论网址是辩论页面网址的列表。

对于如何更有效地执行此操作的任何帮助，将不胜感激！

Answer 1

向量在不断增长，这肯定是低效的。您知道它们有多长（length(debate_urls$url)），因此可以提前设置向量：

n <- length(debate_urls$url)
speakerid <- character(n)
parties <- character(n)
contributions <- character(n)
titles <- character(n)
debatedates <- character(n)

然后您的for循环执行此操作：

for(idx in seq_along(debate_urls$url)){
    i <- debate_urls$url[idx]

    debate_urls <- read_html(i)
    speaker <- debate_urls %>% html_nodes(".debate-speech__speaker__name") %>% html_text("")
    speakerid[idx] <- speaker
    ...
}

我不太确定的是，与抓取时间相比，这是否有很大影响。

使用rvest和for循环进行高效的抓取

1 个答案: