Question

我正在尝试为CSI刮除Wiki页面上的所有表：https://en.wikipedia.org/wiki/List_of_CSI:_Crime_Scene_Investigation_episodes 到目前为止，到目前为止，我已经能够使用下面的代码抓取一张桌子（第1季），是否存在for循环，因为它们具有相同的类，因此可以仅循环访问所有表？

这是我的R代码

library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_CSI:_Crime_Scene_Investigation_episodes"
episodes <- url %>%
  read_html() %>%
  html_nodes('#mw-content-text > div > table:nth-child(14)') %>%
  html_table()
episodes <- episodes[[1]]

更新我刚刚意识到每个表选择器都有一个不同的第n个子选择器，因此我决定将每个表选择器分配给如下所示的变量。我现在可以遍历每个表并将结果分配给一个DF /变量“情节”吗？调整后的代码：

library(dplyr)
library(purrr)
url <- "https://en.wikipedia.org/wiki/List_of_CSI:_Crime_Scene_Investigation_episodes"
table1<- '#mw-content-text > div > table:nth-child(14)'
table2<- '#mw-content-text > div > table:nth-child(18)'
table3<- '#mw-content-text > div > table:nth-child(22)'
table4<- '#mw-content-text > div > table:nth-child(26)'
table5<- '#mw-content-text > div > table:nth-child(30)'
table6<- '#mw-content-text > div > table:nth-child(34)'
table7<- '#mw-content-text > div > table:nth-child(38)'
table8<- '#mw-content-text > div > table:nth-child(42)'
table9<- '#mw-content-text > div > table:nth-child(46)'
table10<- '#mw-content-text > div > table:nth-child(50)'
table11<- '#mw-content-text > div > table:nth-child(54)'
table12<- '#mw-content-text > div > table:nth-child(58)'
table13<- '#mw-content-text > div > table:nth-child(62)'
table14<- '#mw-content-text > div > table:nth-child(66)'
table15<- '#mw-content-text > div > table:nth-child(70)'
table16<- '#mw-content-text > div > table:nth-child(74)'
#table17<- '#mw-content-text > div > table:nth-child(79)'
episodes <- url %>%
  read_html() %>%
  html_nodes(table1) %>%
  html_table(fill = T)
episodes <- episodes[[1]]

write.csv(population, file = "test.csv")

Answer 1

如果我对它的理解正确，那么您想要做的就是将除第一个表以外的所有表放在一个数据框中，该表列出了季节并具有不同的列名。

假设您已经安装了purrr和dplyr（tidyverse的一部分），则以下内容应达到您想要的效果：首先提取所有表，然后将它们全部放入（栏第一个）。

library(rvest)

url <- "https://en.wikipedia.org/wiki/List_of_CSI:_Crime_Scene_Investigation_episodes"

episodes <- url %>%
  read_html() %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)

purrr::map_dfr(episodes[-1], dplyr::bind_rows)

为清楚起见，第一条命令创建了一个包含所有表的数据帧列表。

map_dfr告诉它遍历给定列表并输出数据帧。

Web使用R（rvest）抓取多个表

1 个答案: