在问答论坛中刮取日期

时间:2019-04-17 07:26:13

标签: r web-scraping

我正在medhelp论坛中使用以下正常工作的代码抓取问题和答案

library(dplyr)
library(rvest)
library(purrr)
library(RCurl)
library(stringr)
library(tidyr)


# Estimate the number of pages on the forum by dividing the number of pages by 20

page1_html <- getURL("https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=1") 

n_pages <- page1_html %>%
  read_html() %>%
  html_node("div.forum_title") %>%
  html_text() %>%
  str_extract_all("\\d+") %>%
  flatten_chr() %>%
  as.numeric() %>%
  `[`(3) %>%
  {. / 20}

# Get all thread titles and thread links

page_urls <- paste0("https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=", seq_len(n_pages))

page_htmls <- map_chr(page_urls[1], getURL)

scrape_thread_titles <- function(html){
  read_html(html) %>%
    html_nodes(".subj_title a") %>%
    html_text()
}

scrape_thread_links <- function(html){
  read_html(html) %>%
    html_nodes(".subj_title a") %>%
    html_attr("href") %>%
    paste0("https://www.medhelp.org", .)
}

thread_titles <- map(page_htmls, scrape_thread_titles) %>%
  discard(~ length(.x) == 0)

correct_n_pages <- length(thread_titles)

thread_titles <- thread_titles %>%
  flatten_chr()

thread_links <- map(page_htmls, scrape_thread_links) %>%
  `[`(seq_len(correct_n_pages)) %>%
  flatten_chr()

master_data <- tibble(thread_titles, thread_links)

# Scrape all thread posts and poster's IDs

thread_htmls <- map_chr(master_data$thread_links, getURL)

html <- thread_htmls[1]


link <- master_data$thread_links[1]

scrape_poster_ids <- function(html){
  read_html(html) %>%
    html_nodes(css = "span span") %>%
    html_text()
}


scrape_poster_dates <- function(html){
  read_html(html) %>%
    html_nodes(css = ".subj_info .mh_timestamp") %>%
    html_text()
}



scrape_posts <- function(html){
  read_html(html) %>%
    html_nodes(".resp_body , #subject_msg") %>%
    html_text() %>%
    str_replace_all("\r|\n", "") %>%
    str_trim()
}



master_data <- master_data %>%
  mutate(
    poster_ids = map(thread_htmls, scrape_poster_ids),
    posts = map(thread_htmls, scrape_posts),
   dates = map(thread_htmls, scrape_poster_dates)
  ) %>%
  unnest()

head(master_data, 15)

titles<-master_data$thread_titles
posters<-master_data$poster_ids
posts<-master_data$posts
dates<-master_data$dates

employ.data <- data.frame(titles, posters, posts, fechas)
write.csv(employ.data, "C:/Asperger/page1.csv", na = "")

现在,我尝试添加帖子的日期,仅添加问题和答案,而不添加评论,用户也添加评论,但我不将其包含在输出文件中。

我一直在尝试使用选择器小工具查找问题和答案的日期,并在scrape_poster_dates函数中使用它们,我发现了以下可能的日期标签

.username .mh_timestamp
.username:nth-child(2) .mh_timestamp
 div time
.subj_info .mh_timestamp
.resp_info .mh_timestamp

但是它们都不起作用,我需要保留注释,并且不断收到以下错误消息

Error: All nested columns must have the same number of elements.

我唯一没有收到该错误消息的是

.username:nth-child(2) .mh_timestamp

但是我只能使用它来获取空格。

0 个答案:

没有答案