在数据帧列表上运行rapply

时间:2017-01-23 18:39:41

标签: r list recursion lapply

要对几年前的两个rapply问题,herehere进行跟进,似乎rapply仅适用于简单类(即矢量,矩阵)而不是多方面的data.frame类。

在大多数情况下,如下所示,rapply等效项是嵌套的lapply及其变体包装器v/sapply,其中嵌套数与级别数相关联。下面是我在矢量,矩阵和数据帧类型之间嵌套lapplyrapply之间的测试场景。除数据名之外的所有数据都无法均衡。

问题

基础R中是否存在rapply()的用例,以递归方式对数据帧列表运行操作,并返回数据帧列表,就像它对矢量或矩阵列表一样?如果没有,这是一个错误还是应该在?rapply基础R文档中发出警告?大多数教程都没有显示rapply数据框示例。

一维 (字符向量)

下面显示rapply在运行字符数的简单字符向量上与嵌套lapply的对应关系,甚至可以显示rapply在处理过程中的速度有多快:

library(microbenchmark)

ScriptLists <- list(R = list.files(path="/path/to/Scripts", pattern="\\.R"),
                    Python = list.files(path="/path/to/Scripts", pattern="\\.py"),
                    SQL = list.files(path="/path/to/Scripts", pattern="\\.sql"),
                    PHP = list.files(path="/path/to/Scripts", pattern="\\.xsl"),
                    XSLT = list.files(path="/path/to/Scripts", pattern="\\.php"))

microbenchmark(
  ScriptsLists1 <- lapply(ScriptLists, function(i){
    unname(vapply(i, function(x){ 
      nchar(x)
      }, numeric(1)))
    })
)
# Unit: microseconds
# min      lq     mean   median      uq     max neval
# 384 408.782 524.1363 434.7675 678.016 886.377   100

microbenchmark(
  ScriptsLists2 <- rapply(ScriptLists, function(x){
    nchar(x)
  }, how="list")
)
# Unit: microseconds
# min           lq     mean   median     uq     max neval
# 110.196 112.8425 131.6141 114.5265 123.91 352.722   100

all.equal(ScriptsLists1, ScriptsLists2)
# [1] TRUE

二维类型 (矩阵与data.frame)

输入数据框(从StackOverflow top users 的最高年份排名中提取),按语言标记(C#,Python,R等)构建最高用户数据框列表。

df <- structure(list(user = structure(c(12L, 14L, 19L, 35L, 22L, 32L, 
1L, 36L, 7L, 9L, 2L, 18L, 27L, 6L, 30L, 20L, 10L, 24L, 29L, 23L, 
5L, 3L, 4L, 15L, 25L, 17L, 11L, 8L, 33L, 13L, 34L, 16L, 21L, 
26L, 28L, 31L), .Label = c("akrun", "alecxe", "Alexey Mezenin", 
"BalusC", "Barmar", "CommonsWare", "Darin Dimitrov", "dasblinkenlight", 
"Eric Duminil", "Felix Kling", "Frank van Puffelen", "Gordon Linoff", 
"Greg Hewgill", "Günter Zöchbauer", "GurV", "Hans Passant", "JB Nizet", 
"Jean-François Fabre", "jezrael", "Jon Skeet", "Jonathan Leffler", 
"Martijn Pieters", "Martin R", "matt", "Nina Scholz", "paxdiablo", 
"piRSquared", "Pranav C Balan", "Psidom", "Quentin", "Suragch", 
"T.J. Crowder", "Tim Biegeleisen", "unutbu", "VonC", "Wiktor Stribi?ew"
), class = "factor"), link = structure(c(2L, 17L, 21L, 31L, 1L, 
10L, 27L, 28L, 22L, 33L, 35L, 34L, 20L, 3L, 15L, 19L, 18L, 25L, 
29L, 4L, 8L, 5L, 11L, 32L, 6L, 30L, 16L, 24L, 13L, 36L, 14L, 
12L, 9L, 7L, 23L, 26L), .Label = c("http://www.stackoverflow.com//users/100297/martijn-pieters", 
"http://www.stackoverflow.com//users/1144035/gordon-linoff", 
"http://www.stackoverflow.com//users/115145/commonsware", "http://www.stackoverflow.com//users/1187415/martin-r", 
"http://www.stackoverflow.com//users/1227923/alexey-mezenin", 
"http://www.stackoverflow.com//users/1447675/nina-scholz", "http://www.stackoverflow.com//users/14860/paxdiablo", 
"http://www.stackoverflow.com//users/1491895/barmar", "http://www.stackoverflow.com//users/15168/jonathan-leffler", 
"http://www.stackoverflow.com//users/157247/t-j-crowder", "http://www.stackoverflow.com//users/157882/balusc", 
"http://www.stackoverflow.com//users/17034/hans-passant", "http://www.stackoverflow.com//users/1863229/tim-biegeleisen", 
"http://www.stackoverflow.com//users/190597/unutbu", "http://www.stackoverflow.com//users/19068/quentin", 
"http://www.stackoverflow.com//users/209103/frank-van-puffelen", 
"http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer", 
"http://www.stackoverflow.com//users/218196/felix-kling", "http://www.stackoverflow.com//users/22656/jon-skeet", 
"http://www.stackoverflow.com//users/2336654/pirsquared", "http://www.stackoverflow.com//users/2901002/jezrael", 
"http://www.stackoverflow.com//users/29407/darin-dimitrov", "http://www.stackoverflow.com//users/3037257/pranav-c-balan", 
"http://www.stackoverflow.com//users/335858/dasblinkenlight", 
"http://www.stackoverflow.com//users/341994/matt", "http://www.stackoverflow.com//users/3681880/suragch", 
"http://www.stackoverflow.com//users/3732271/akrun", "http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew", 
"http://www.stackoverflow.com//users/4983450/psidom", "http://www.stackoverflow.com//users/571407/jb-nizet", 
"http://www.stackoverflow.com//users/6309/vonc", "http://www.stackoverflow.com//users/6348498/gurv", 
"http://www.stackoverflow.com//users/6419007/eric-duminil", "http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre", 
"http://www.stackoverflow.com//users/771848/alecxe", "http://www.stackoverflow.com//users/893/greg-hewgill"
), class = "factor"), location = structure(c(17L, 15L, 8L, 12L, 
10L, 26L, 1L, 28L, 23L, 1L, 17L, 25L, 6L, 29L, 26L, 19L, 24L, 
1L, 5L, 13L, 4L, 2L, 3L, 1L, 7L, 20L, 21L, 27L, 22L, 11L, 1L, 
16L, 9L, 1L, 18L, 14L), .Label = c("", "??????", "Amsterdam, Netherlands", 
"Arlington, MA", "Atlanta, GA, United States", "Bellevue, WA, United States", 
"Berlin, Deutschland", "Bratislava, Slovakia", "California, USA", 
"Cambridge, United Kingdom", "Christchurch, New Zealand", "France", 
"Germany", "Hohhot, China", "Linz, Austria", "Madison, WI", "New York, United States", 
"Ramanthali, Kannur, Kerala, India", "Reading, United Kingdom", 
"Saint-Etienne, France", "San Francisco, CA", "Singapore", "Sofia, Bulgaria", 
"Sunnyvale, CA", "Toulouse, France", "United Kingdom", "United States", 
"Warsaw, Poland", "Who Wants to Know?"), class = "factor"), year_rep = structure(c(36L, 
35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L, 
22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L, 
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("3,580", "3,604", 
"3,636", "3,649", "3,688", "3,735", "3,796", "3,814", "3,886", 
"3,920", "3,923", "3,950", "4,016", "4,046", "4,142", "4,179", 
"4,195", "4,236", "4,313", "4,324", "4,348", "4,464", "4,475", 
"4,482", "4,526", "4,723", "4,854", "4,936", "4,948", "5,188", 
"5,258", "5,337", "5,577", "5,740", "5,835", "5,985"), class = "factor"), 
    total_rep = structure(c(18L, 2L, 34L, 27L, 22L, 20L, 5L, 
    3L, 31L, 1L, 6L, 9L, 13L, 25L, 21L, 36L, 14L, 4L, 11L, 7L, 
    8L, 10L, 30L, 29L, 24L, 15L, 35L, 17L, 33L, 23L, 12L, 28L, 
    16L, 19L, 26L, 32L), .Label = c("12,557", "154,439", "158,134", 
    "220,515", "229,553", "233,368", "269,380", "289,989", "30,027", 
    "31,602", "36,950", "401,595", "41,183", "411,535", "418,780", 
    "455,157", "475,813", "499,408", "507,043", "508,310", "509,365", 
    "525,176", "529,137", "61,135", "616,135", "64,476", "651,397", 
    "672,118", "7,932", "703,046", "709,683", "71,032", "77,211", 
    "83,237", "86,520", "921,690"), class = "factor"), tag1 = structure(c(15L, 
    2L, 10L, 6L, 11L, 8L, 12L, 13L, 4L, 14L, 11L, 11L, 10L, 1L, 
    8L, 4L, 8L, 16L, 11L, 16L, 8L, 9L, 7L, 15L, 8L, 7L, 5L, 4L, 
    15L, 6L, 11L, 4L, 3L, 3L, 8L, 16L), .Label = c("android", 
    "angular2", "c", "c#", "firebase", "git", "java", "javascript", 
    "laravel", "pandas", "python", "r", "regex", "ruby", "sql", 
    "swift"), class = "factor"), tag2 = structure(c(23L, 24L, 
    19L, 8L, 20L, 14L, 6L, 13L, 3L, 21L, 22L, 20L, 19L, 12L, 
    10L, 12L, 14L, 11L, 17L, 11L, 18L, 18L, 15L, 16L, 2L, 9L, 
    7L, 12L, 16L, 19L, 17L, 1L, 4L, 5L, 14L, 11L), .Label = c(".net", 
    "arrays", "asp.net-mvc", "bash", "c++", "dplyr", "firebase-database", 
    "github", "hibernate", "html", "ios", "java", "javascript", 
    "jquery", "jsf", "mysql", "pandas", "php", "python", "python-3.x", 
    "ruby-on-rails", "selenium", "sql-server", "typescript"), class = "factor"), 
    tag3 = structure(c(20L, 17L, 11L, 12L, 24L, 15L, 11L, 8L, 
    5L, 4L, 23L, 24L, 11L, 3L, 10L, 1L, 6L, 31L, 25L, 28L, 18L, 
    19L, 26L, 27L, 22L, 16L, 2L, 9L, 15L, 13L, 21L, 30L, 29L, 
    7L, 14L, 2L), .Label = c(".net", "android", "android-intent", 
    "arrays", "asp.net-mvc-3", "asynchronous", "bash", "c#", 
    "c++", "css", "dataframe", "docker", "git-pull", "html", 
    "java", "java-8", "javascript", "jquery", "laravel-5.3", 
    "mysql", "numpy", "object", "protractor", "python-2.7", "r", 
    "servlets", "sql-server", "swift3", "unix", "winforms", "xcode"
    ), class = "factor")), .Names = c("user", "link", "location", 
"year_rep", "total_rep", "tag1", "tag2", "tag3"), class = "data.frame", row.names = c(NA, 
-36L))

R代码

以下方法在类型,矩阵或数据框架中平均 year_rep total_rep (第5/6)列。请务必更改设置块中的return语句,并交换注释的部分类型。请注意,矩阵的rapply()返回与嵌套的lapply相同,但不适用于数据帧返回。

# NESTED LIST SETUP ------------------------------------
LangLists <- list(`c#`=list(), python=list(), sql=list(), php=list(), r=list(),
                  java=list(), javascript=list(), ruby=list(), `c++`=list())

LangLists <- setNames(mapply(function(i, j){

  df <- subset(df, tag1 == j | tag2 == j | tag3 == j)
  df$year_rep <- as.numeric(as.character(gsub(",", "", df$year_rep)))
  df$total_rep <- as.numeric(as.character(gsub(",", "", df$total_rep)))

  return(list(as.matrix(df)))   # MATRIX TYPE
  # return(list(df))            # DF TYPE

}, LangLists, names(LangLists), SIMPLIFY=FALSE), names(LangLists))
# -----------------------------------------------------

# MATRIX RETURN
LangLists1 <- lapply(LangLists, function(i){
  lapply(i, function(df){         
    cbind(mean(as.numeric(df[,5])),
          mean(as.numeric(df[,6])))        
  })
})

LangLists2 <- rapply(LangLists, function(i){      
  cbind(mean(as.numeric(i[,5])),
        mean(as.numeric(i[,6])))      
}, classes="matrix", how="list")

all.equal(LangLists1, LangLists2)
# [1] TRUE


# DATA FRAME RETURN
LangLists1 <- lapply(LangLists, function(i){
  lapply(i, function(df){         
    data.frame(year_rep=mean(df$year_rep),
               total_rep=mean(df$total_rep))        
  })
})

LangLists2 <- rapply(LangLists, function(i){      
    data.frame(year_rep=mean(i$year_rep),
               total_rep=mean(i$total_rep))      
}, classes="data.frame", how="list")

all.equal(LangLists1, LangLists2)

# [1] "Component “c#”: Component 1: Names: 2 string mismatches"                                               
# [2] "Component “c#”: Component 1: Attributes: < names for target but not for current >"                     
# [3] "Component “c#”: Component 1: Attributes: < Length mismatch: comparison on first 0 components >"        
# [4] "Component “c#”: Component 1: Length mismatch: comparison on first 2 components"                        
# [5] "Component “c#”: Component 1: Component 1: Modes: numeric, NULL"  
...

实际上,虽然嵌套lapply仍然是 rep 的两列完整数据帧的列表,但数据帧的rapply会将基础数据帧转换为NULL列表。再次,为什么rapply无法返回与矢量/矩阵相比的原始数据帧列表?

# $`c#`
# $`c#`[[1]]
# $`c#`[[1]]$X
# NULL

# $`c#`[[1]]$user
# NULL

# $`c#`[[1]]$link
# NULL

# $`c#`[[1]]$location
# NULL

# $`c#`[[1]]$year_rep
# NULL

# $`c#`[[1]]$total_rep
# NULL

# $`c#`[[1]]$tag1
# NULL

# $`c#`[[1]]$tag2
# NULL

# $`c#`[[1]]$tag3
# NULL

# $python
# $python[[1]]
# $python[[1]]$X
# NULL

# $python[[1]]$user
# NULL

# $python[[1]]$link
# NULL

# $python[[1]]$location
# NULL

# $python[[1]]$year_rep
# NULL

# $python[[1]]$total_rep
# NULL

# $python[[1]]$tag1
# NULL

# $python[[1]]$tag2
# NULL

# $python[[1]]$tag3
# NULL

1 个答案:

答案 0 :(得分:2)

rapply似乎不是为处理data.frames列表而设计的。

来自?rapply的详情部分,如果

  

如何=&#34;列表&#34;或如何=&#34;取消列表&#34;,复制列表,所有非列表元素,其中包含类的类被替换为应用 f 对元素和所有其他替换为 deflt

由于data.frames是列表,因此它们不属于第一类。因此,它们属于所有其他 catch-all,并被dflt取代,其默认值为NULL。这解释了问题中最后一行代码的结果。

最后的替代论点是&#34;取代&#34;并且似乎在此&#34;模式&#34;

下简单地忽略了data.frames
  

如果=&#34;替换&#34;,列表的每个元素本身不是列表并且类中包含的类被替换为应用的结果f 到元素。

没有提及自己列出的元素并运行上面的代码如何=&#34;替换&#34;似乎返回一个嵌套列表,其中data.frames现在是简单列表。因此,似乎rapply经历并剥离了类属性。