我有两个变量和49个观察的数据框,如下所示:
Link URL
1 20 political parties urge govt... http://www.mmtimes.com/index.php/national-news/27265-20-political-parties-urge-govt-to-act-on-rakhine-issue.html
2 Tens of thousands protest across.. http://www.mmtimes.com/index.php/national-news/27236-tens-of-thousands-protest-across-rakhine-over-security-issues.html
我想使用scrape byline和文章文章来创建一个数据框,如下所示:
data.frame(title=html_text(html_nodes(pg, ".contentheading")),
authordate=html_text(html_nodes(pg, ".create")),
text=html_text(html_nodes(pg, "p")),
stringsAsFactors=FALSE)
我可以使用以下方式获取每篇文章的数据:
read_html(df$URL[1]) %>%
html_nodes(".create") %>%
html_text()
如何使用当前数据框中的列(df $ URL)从URL中删除所有数据?
更新:我现在正在使用以下内容尝试获取数据:
mmtimes_data <- data.frame(title=character(),author=character(),text=character())
for (i in url) {
mmtimes <- read_html(i)
title <- mmtimes %>% html_nodes(".contentheading") %>% html_text() %>% as.character()
author <- mmtimes %>% html_nodes("created") %>% html_text() %>% as.character()
text <- mmtimes %>% html_nodes("p") %>% html_text() %>% as.character()
temp <- data.frame(title, authordate, text)
mmtimes_data <- rbind(mmtimes_data,temp)
cat("*")
}
但是,我收到以下错误:
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
expecting a single value
更新#2
我使用以下代码解决了这个问题:
mmtimes_data <- data.frame(title=character(),author=character(),text=character())
for (i in links) {
mmtimes <- read_html(i)
title <- mmtimes %>% html_nodes(".contentheading") %>% html_text() %>% as.character()
author <- mmtimes %>% html_nodes(".create") %>% html_text() %>% as.character()
text <- mmtimes %>% html_nodes("p") %>% html_text() %>% as.character()
temp <- data.frame(title, author, text)
mmtimes_data <- rbind(mmtimes_data,temp)
cat("*")
}