Question

我成功完成了有关网页抓取的RVest教程，并且想知道： 1）如何删除“ \ n”？导出文件之前？ 2）如何将数据导出到CSV文件？

PS这是上述教程的链接： https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/

我是R的新手，因此能提供帮助。

这是我使用的代码：

library(rvest)
library(dplyr)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")

### movie rating ###

lego_movie %>%
  html_node("strong span") %>%
  html_text() %>%
  as.numeric()


### actors names ###

lego_movie %>%
  html_nodes(".primary_photo+ td") %>%
  html_text()

Answer 1

要删除\n或任何其他前导或尾随空格或制表符，只需将str_replace_all("[ \t\r\n]" , "")添加到末尾并将其保存到变量中即可，因为您要将其另存为csv：< / p>

actor_list <- lego_movie %>%
     html_nodes(".primary_photo+ td") %>%
     html_text() %>% str_replace_all("[ \t\r\n]" , "")

输出：

 [1] "WillArnett"     "ElizabethBanks" "CraigBerry"     "AlisonBrie"     "DavidBurrows"  
 [6] "AnthonyDaniels" "CharlieDay"     "AmandaFarinos"  "KeithFerguson"  "WillFerrell"   
[11] "WillForte"      "DaveFranco"     "MorganFreeman"  "ToddHansen"     "JonahHill"

要另存为CSV，请执行以下操作：

df <- data.frame(actor_list)
write.csv(df, 'actor_list.csv')

Answer 2

从关联图像的[alt]属性中拉出，无需进行任何额外的字符串操作。我没有必要重复关于写csv的现有答案。

library(rvest)
library(dplyr)

lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")

### actors names ###

lego_movie %>% html_nodes(".cast_list td:first-child [alt]") %>% html_attr(., "alt")
# lego_movie %>% html_nodes(".cast_list td:nth-child(1) [alt]") %>% html_attr(., "alt")

旁注：很多信息都以json的形式存储在脚本标签中

library(jsonlite)
library(rvest)

lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
data <- jsonlite::fromJSON(lego_movie %>%html_node("[type='application/ld+json']")%>%html_text())
#example
print(data$actor)

使用选择器找到节点后如何导出数据

2 个答案: