闭包作为数据合并成语的解决方案

时间:2011-10-17 17:23:56

标签: r functional-programming closures

我正试图将我的头围绕在闭包上,我我发现了一个他们可能会有所帮助的案例。

我可以使用以下部分:

  • 一组用于清理状态名称的正则表达式,位于函数
  • 具有状态名称(上述函数创建的标准化表单)和状态ID代码的data.frame,用于链接两者(“合并映射”)

这个想法是,给定一些带有草率国家名称的数据框架(资本列为“华盛顿特区”,“华盛顿特区”,“哥伦比亚特区”等),具有单一功能返回删除了状态名称列的相同data.frame,仅剩下状态ID代码。然后,后续合并可以一致地发生。

我可以通过多种方式实现这一点,但是一种似乎特别优雅的方法是将合并映射和正则表达式以及代码处理闭包内的所有内容(遵循闭包是一个概念)功能与数据)。

问题1:这是一个合理的想法吗?

问题2:如果是,我该如何在R?

中进行

这是一个简单的简单干净状态名称函数,可以处理示例数据:

cleanStateNames <- function(x) {
  x <- tolower(x)
  x[grepl("columbia",x)] <- "DC"
  x
}

以下是将运行最终函数的一些示例数据:

dat <- structure(list(state = c("Alabama", "Alaska", "Arizona", "Arkansas", 
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia", 
"Florida"), pop08 = structure(c(29L, 44L, 40L, 18L, 25L, 30L, 
22L, 48L, 36L, 13L), .Label = c("1,050,788", "1,288,198", "1,315,809", 
"1,316,456", "1,523,816", "1,783,432", "1,814,468", "1,984,356", 
"10,003,422", "11,485,910", "12,448,279", "12,901,563", "18,328,340", 
"19,490,297", "2,600,167", "2,736,424", "2,802,134", "2,855,390", 
"2,938,618", "24,326,974", "3,002,555", "3,501,252", "3,642,361", 
"3,790,060", "36,756,666", "4,269,245", "4,410,796", "4,479,800", 
"4,661,900", "4,939,456", "5,220,393", "5,627,967", "5,633,597", 
"5,911,605", "532,668", "591,833", "6,214,888", "6,376,792", 
"6,497,967", "6,500,180", "6,549,224", "621,270", "641,481", 
"686,293", "7,769,089", "8,682,661", "804,194", "873,092", "9,222,414", 
"9,685,744", "967,440"), class = "factor")), .Names = c("state", 
"pop08"), row.names = c(NA, 10L), class = "data.frame")

示例合并映射(实际的映射将FIPS代码链接到状态,因此无法轻易生成):

merge_map <- data.frame(state=dat$state, id=seq(10) )

编辑根据下面的crippledlambda答案,这是对该功能的尝试:

prepForMerge <- local({
  merge_map <- structure(list(state = c("alabama", "alaska", "arizona", "arkansas",  "california", "colorado", "connecticut", "delaware", "DC", "florida" ), id = 1:10), .Names = c("state", "id"), row.names = c(NA, -10L ), class = "data.frame")
  list(
    replace_merge_map=function(new_merge_map) {
      merge_map <<- new_merge_map
    },
    show_merge_map=function() {
      merge_map
    },
    return_prepped_data.frame=function(dat) {
      dat$state <- cleanStateNames(dat$state)
      dat <- merge(dat,merge_map)
      dat <- subset(dat,select=c(-state))
      dat
    }
  )
})

> prepForMerge$return_prepped_data.frame(dat)
        pop08 id
1   4,661,900  1
2     686,293  2
3   6,500,180  3
4   2,855,390  4
5  36,756,666  5
6   4,939,456  6
7   3,501,252  7
8     591,833  9
9     873,092  8
10 18,328,340 10

在我考虑解决这个问题之前,还存在两个问题:

  1. 每次调用prepForMerge$return_prepped_data.frame(dat)都很痛苦。有任何方法可以使用默认函数,以便我可以调用prepForMerge(dat)吗?我猜不会给出它是如何实现的,但也许至少有一个默认fxn的约定....

  2. 如何避免在merge_map定义中混合数据和代码?理想情况下,我会在其他地方清理merge_map,然后在封闭内部抓住它并存储它。

1 个答案:

答案 0 :(得分:4)

我可能会忽略你的问题,但这是你可以使用闭包的一种方式:

> replaceStateNames <- local({
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   function(patt,newtext) {
+     statenames <- tolower(statenames)
+     statenames[grepl(patt,statenames)] <- newtext
+     statenames
+   }
+ })
> 
> replaceStateNames("columbia","DC")
 [1] "alabama"     "alaska"      "arizona"     "arkansas"    "california" 
 [6] "colorado"    "connecticut" "delaware"    "DC"          "florida"    
> replaceStateNames("alaska","palincountry")
 [1] "alabama"              "palincountry"         "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "florida"             
> replaceStateNames("florida","jebbushland")
 [1] "alabama"              "alaska"               "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "jebbushland"    
> 

但是为了概括,您可以用数据帧定义替换statenames,并返回使用此数据帧的函数(或函数列表),而不必将其作为参数传递给函数调用。示例(但请注意我在ignore.case=TRUE中使用了grepl参数):

> replaceStateNames <- local({
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   list(justreturn=function(patt,newtext) {
+     statenames[grepl(patt,statenames,ignore.case=TRUE)] <- newtext
+     statenames
+   },reassign=function(patt,newtext) {
+     statenames <<- replace(statenames,grepl(patt,statenames,ignore.case=TRUE),newtext)
+     statenames
+   })
+ })

就像第一个例子一样:

> replaceStateNames$justreturn("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

只返回statenames的词法范围值,以检查原始值是否保持不变:

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"              "Alaska"               "Arizona"             
 [4] "Arkansas"             "California"           "Colorado"            
 [7] "Connecticut"          "Delaware"             "District of Columbia"
[10] "Florida"             

做同样的事情,但让改变“永久”:

> replaceStateNames$reassign("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

请注意,附加到这些功能的statenames值已发生变化。

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

在任何情况下,您都可以将statenames替换为数据框,将这些简单函数替换为“合并贴图”或您希望的任何其他贴图。

修改

说到“合并”,这就是你要找的东西吗?使用闭包的第一个?merge示例的实现:

> authors <- data.frame(surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+                       nationality = c("US", "Australia", "US", "UK", "Australia"),
+                       deceased = c("yes", rep("no", 4)))
> books <- data.frame(name = I(c("Tukey", "Venables", "Tierney",
+                       "Ripley", "Ripley", "McNeil", "R Core")),
+                     title = c("Exploratory Data Analysis",
+                       "Modern Applied Statistics ...",
+                       "LISP-STAT",
+                       "Spatial Statistics", "Stochastic Simulation",
+                       "Interactive Data Analysis",
+                       "An Introduction to R"),
+                     other.author = c(NA, "Ripley", NA, NA, NA, NA,
+                       "Venables & Smith"))
> 
> mergewithauthors <- with(list(authors=authors),function(books) 
+   merge(authors, books, by.x = "surname", by.y = "name"))
> 
> mergewithauthors(books)
   surname nationality deceased                         title other.author
1   McNeil   Australia       no     Interactive Data Analysis         <NA>
2   Ripley          UK       no            Spatial Statistics         <NA>
3   Ripley          UK       no         Stochastic Simulation         <NA>
4  Tierney          US       no                     LISP-STAT         <NA>
5    Tukey          US      yes     Exploratory Data Analysis         <NA>
6 Venables   Australia       no Modern Applied Statistics ...       Ripley

修改2

要将文件读入将以词汇方式绑定的对象,您可以执行

fn <- local({
  data <- read.csv("filename.csv")
  function(...) {
    ...
  }
})

fn <- with(list(data=read.csv("filename.csv")),
     function(...) {
       ...
     }
   })

fn <- with(local(data <- read.csv("filename.csv")),
     function(...) {
       ...
     }
   })

等等。 (我假设函数(...)将与你的“merge_map”有关。您也可以使用evalq代替local。要“引入”驻留在全局空间(或封闭环境)中的对象,您可以执行以下操作

globalobj <- value      ## could be from read.csv()
fn <- local({
  localobj <- globalobj ## if globalobj is not locally defined, 
                        ## R will look in enclosing environment
                        ## in this case, the globalenv()
  function(...) {
    ...
  }
})

然后稍后修改globalobj将不会更改附加到函数的localobj(因为几乎(?)R中的所有内容都遵循按值传递语义)。您也可以使用with代替local,如上例所示。