Question

我正在尝试提取有关名人/值得注意的死亡数据以供分析。维基百科在他们的html路径上有一个非常规则的结构，涉及显着的死亡日期。它看起来像：

https://en.wikipedia.org/wiki/Deaths_in_"MONTH"_"YEAR"

例如，此链接导致2014年3月的显着死亡。

https://en.wikipedia.org/wiki/Deaths_in_March_2014

我找到了我需要的列表的CSS位置＆＃34;＆＃34;＃mw-content-text h3 + ul li＆＃34;并成功提取它以获取特定链接。现在，我试图编写一个循环来完成我选择的几个月和几年。我认为这是一个非常简单的嵌套循环，但是我在2015年测试时遇到了错误。

library(rvest)
data = data.frame()
 mlist = c("January","February","March","April","May","June","July","August",
              "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
           "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    data = rbind(data,text,stringsAsFactors=FALSE)
      }
 }

当我注释掉这一行：

data = rbind(data,text,stringsAsFactors=FALSE)

不会返回任何错误，因此它与此位明显相关。我也发布了我的整个代码以供其他评论使用。这里的目标是经过多年，然后专注于数年和数月的分配。为此，我只需要保持死亡的年龄，月份和年份。

谢谢！

编辑：对不起，他们在技术上是警告，而不是错误。当我尝试查看＆＃34;数据＆＃34;这是一个巨大的混乱。

当我在一个特定的URL上运行此代码而不是循环时，它可以正常工作并返回可读的输出。

site = read_html("https://en.wikipedia.org/wiki/Deaths_in_January_2015")
fnames = html_nodes(site,"#mw-content-text h3+ ul li")
text = html_text(fnames)

以下是该数据集中的几行：

text[1:5]
[1] "Barbara Atkinson, 88, British actress (Z-Cars).[1]"                                         
[2] "Staryl C. Austin, 94, American air force brigadier general.[2]"                             
[3] "Ulrich Beck, 70, German sociologist, heart attack.[3]"                                      
[4] "Fiona Cumming, 77, British television director (Doctor Who).[4]"                            
[5] "Eric Cunningham, 65, Canadian politician, Ontario MPP for Wentworth North (1975â€“1984).[5]"

Answer 1

我无法得到与你相同的错误，但我想我知道你想做什么。

我觉得这与每个月死亡人数不等有关。

我建议这样做

mlist = c("January","February","March","April","May","June","July","August",
      "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                       "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    assign(mlist[m],text)
  }
}

这会为每个月的死亡创建一个角色列表。

另一种选择（为了以后在循环中更容易使用它）是使用列表：

data = vector("list",12)
mlist = c("January","February","March","April","May","June","July","August",
      "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                       "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    data[[m]] = text
  }
}

就个人而言，我不喜欢处理R中的列表。但这似乎是最好的解决方法。

Answer 2

html_text(fnames)返回一个数组。您的问题是尝试将数组附加到数据帧上尝试将变量text转换为数据框，然后再添加：

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
           "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)

    temp<-data.frame(text, stringsAsFactors = FALSE)

    data = rbind(data,temp)
    }
 }

出于性能原因，这不是最佳技术。每次循环时，都会重新分配数据帧的内存，这会降低性能，这是一次性事件，在这种情况下应该可以管理的限制请求数。

循环从R中的维基百科中搜集数据

2 个答案: