Question

我正在尝试从www.filmneweurope.com（主页；所有页面）中抓取文章。最终结果应该是一个数据框，其中包含每篇文章的网址（每页20个），标题，国家（类别），日期，介绍文字和全文。对于介绍性文字和全文，需要有一个循环到每个文章，以从文章页面中抓取内容。

或者是否可以提取所有文章的文本以分隔.txt文件，并另外具有一个带有URL，标题，日期和对相关.txt文件（例如数字）的引用的数据框？

因此，我需要一个循环遍历每个对象（文章）并遍历Film New Europe的所有页面（主页）的程序。

我编写的代码似乎有效，但似乎并未真正捕获数据：我的数据框为空。我不知道为什么...

library(rvest)
library(plyr)

fne_home <- html_session("http://www.filmneweurope.com/")

getArticle <- function(fne_article){
  title<-fne_article%>%
    html_nodes(".itemTitle")%>%
    html_text()%>%
    gsub("^\\s+|\\s+$", "", .)
  date<-fne_article%>%
    html_nodes(".itemDateCreated")%>%
    html_text()%>%
    gsub("^\\s+|\\s+$", "", .)
  introtext<-fne_article%>%
    html_nodes(".itemIntroText")%>%
    html_text()%>%
    gsub("^\\s+|\\s+$", "", .)
  fulltext<-fne_article%>%
    html_nodes(".itemFullText")%>%
    html_text()%>%
    gsub("^\\s+|\\s+$", "", .)
  record_article<-data.frame(title, date, introtext, fulltext)
  record_article
}

get20articles<-function(articles_URLs){
  data<- data.frame()
  i=1
  for(i in 1:length(articles_URLs)){
    fne_article<-read_html(paste0("http://filmneweurope.com/", articles_URLs[i]))
    record_article<-getArticle(fne_article)
    data<- rbind.fill(data, record_article)
    print(i)
  }
  data
}

get20articlesXpages <- function(fne_home, X){
  data<-data.frame() 
  i=1
  for(i in 1:X){
    if(i !=1){ #go to the next page but don't skip the first one
      next_URL<-fne_home%>%
        html_nodes("li")%>%
        html_nodes("a")%>%
        html_attr("href")
      fne_home<-jump_to(fne_home, paste0("www.filmneweurope.com", next_URL[83]))
    }
    articles_URLs <-fne_home%>%
      html_nodes(".catItemTitle")%>%
      html_attr("href")
    df20articles<-get20articles(articles_URLs)
    data<-rbind.fill(data,df20articles)
    print(paste0("Page",i))
  }
  data
}
articles<- get20articlesXpages(fne_home,2)

非常感谢您的帮助！提前致谢。

通过rvest循环进行网络抓取

0 个答案: