我正在medhelp论坛中使用以下正常工作的代码抓取问题和答案
library(dplyr)
library(rvest)
library(purrr)
library(RCurl)
library(stringr)
library(tidyr)
# Estimate the number of pages on the forum by dividing the number of pages by 20
page1_html <- getURL("https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=1")
n_pages <- page1_html %>%
read_html() %>%
html_node("div.forum_title") %>%
html_text() %>%
str_extract_all("\\d+") %>%
flatten_chr() %>%
as.numeric() %>%
`[`(3) %>%
{. / 20}
# Get all thread titles and thread links
page_urls <- paste0("https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=", seq_len(n_pages))
page_htmls <- map_chr(page_urls[1], getURL)
scrape_thread_titles <- function(html){
read_html(html) %>%
html_nodes(".subj_title a") %>%
html_text()
}
scrape_thread_links <- function(html){
read_html(html) %>%
html_nodes(".subj_title a") %>%
html_attr("href") %>%
paste0("https://www.medhelp.org", .)
}
thread_titles <- map(page_htmls, scrape_thread_titles) %>%
discard(~ length(.x) == 0)
correct_n_pages <- length(thread_titles)
thread_titles <- thread_titles %>%
flatten_chr()
thread_links <- map(page_htmls, scrape_thread_links) %>%
`[`(seq_len(correct_n_pages)) %>%
flatten_chr()
master_data <- tibble(thread_titles, thread_links)
# Scrape all thread posts and poster's IDs
thread_htmls <- map_chr(master_data$thread_links, getURL)
html <- thread_htmls[1]
link <- master_data$thread_links[1]
scrape_poster_ids <- function(html){
read_html(html) %>%
html_nodes(css = "span span") %>%
html_text()
}
scrape_poster_dates <- function(html){
read_html(html) %>%
html_nodes(css = ".subj_info .mh_timestamp") %>%
html_text()
}
scrape_posts <- function(html){
read_html(html) %>%
html_nodes(".resp_body , #subject_msg") %>%
html_text() %>%
str_replace_all("\r|\n", "") %>%
str_trim()
}
master_data <- master_data %>%
mutate(
poster_ids = map(thread_htmls, scrape_poster_ids),
posts = map(thread_htmls, scrape_posts),
dates = map(thread_htmls, scrape_poster_dates)
) %>%
unnest()
head(master_data, 15)
titles<-master_data$thread_titles
posters<-master_data$poster_ids
posts<-master_data$posts
dates<-master_data$dates
employ.data <- data.frame(titles, posters, posts, fechas)
write.csv(employ.data, "C:/Asperger/page1.csv", na = "")
现在,我尝试添加帖子的日期,仅添加问题和答案,而不添加评论,用户也添加评论,但我不将其包含在输出文件中。
我一直在尝试使用选择器小工具查找问题和答案的日期,并在scrape_poster_dates
函数中使用它们,我发现了以下可能的日期标签
.username .mh_timestamp
.username:nth-child(2) .mh_timestamp
div time
.subj_info .mh_timestamp
.resp_info .mh_timestamp
但是它们都不起作用,我需要保留注释,并且不断收到以下错误消息
Error: All nested columns must have the same number of elements.
我唯一没有收到该错误消息的是
.username:nth-child(2) .mh_timestamp
但是我只能使用它来获取空格。