在创建新变量时,函数中的for循环和/或lapply

时间:2017-10-10 11:18:21

标签: r for-loop lapply

我已经提交了lapply声明(邮政编码来自5个大文本字段) 在一个函数中:

opm_naar_postc=function(kolom1,kolom2,kolom3,kolom4,kolom5) {
    postc=lapply(kolom1, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc1=lapply(kolom1, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc2=lapply(kolom2, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc3=lapply(kolom2, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc4=lapply(kolom3, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc5=lapply(kolom3, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc6=lapply(kolom4, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc7=lapply(kolom4, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc8=lapply(kolom5, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc9=lapply(kolom5, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])

然后我想删除postc9中的任何空格,圆点,NAs等

postcodes=c("postc","postc1","postc2","postc3","postc4","postc5","postc6","postc7","postc8","postc9")
for (i in postcodes) {
  i=gsub(" ","",i)
  i=gsub("NA|[[:punct:]]","",i)  }

最后,我将所有postc粘贴到postc9,因此剩下一个变量。这个变量是我的返回变量。 所以我把这个函数称为:

df = df %>% mutate(postcode=opm_naar_postc(var1,var2,var3,var4,var5)) 

首先,for循环不起作用(没有错误,但它没有做任何事情)。当我不使用for循环时,它确实有效。 其次,我想将所有10个应用规则放在一个for循环中,这可能吗?我尝试过很多东西,但它似乎没有用......

谁能帮帮我?

谢谢!

我的数据帧df的一个例子:

   var1            var2          var3               var4         var5
blablaehdhde    blablatext   blabla 1983 rf    blablatext     blablatext
1982 rf blabla text blala     blablbal         blaakakk text  hahahahah
blblatext      textte8743GH  sdkhflksfjf       kjsnhblabla     gagagagag

预期结果:

postcode
1983rf
1982rf
8743GH

2 个答案:

答案 0 :(得分:1)

这是使用正则表达式的想法,

gsub('^\\D*?(\\d+)\\s?(\\D{2}).*$', '\\1\\2', grep('\\d+', unlist(df), value = TRUE))

#   var12    var23    var31 
#"1982rf" "8743GH" "1983rf" 

答案 1 :(得分:0)

您可以尝试:

# your data
df <- structure(c("blablaehdhde", "1982 rf blabla", "blblatext", "blablatext", 
"text blala", "textte8743GH", "blabla 1983 rf", "blablbal", "sdkhflksfjf", 
"blablatext", "blaakakk text", "kjsnhblabla", "blablatext", "hahahahah", 
"gagagagag"), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("var1", 
"var2", "var3", "var4", "var5")))


# pipeline
library(tidyverse)
library(stringi)
as.tibble(df) %>% 
          gather() %>% 
          mutate(value=gsub(" ", "", value)) %>% 
          mutate(postcode=stri_extract_all_regex(value, "[0-9]+(.{2})", simplify =T)) %>% 
          filter(!is.na(postcode)) 
# A tibble: 3 x 3
    key        value postcode
  <chr>        <chr>    <chr>
1  var1 1982rfblabla   1982rf
2  var2 textte8743GH   8743GH
3  var3 blabla1983rf   1983rf