我已经提交了lapply声明(邮政编码来自5个大文本字段) 在一个函数中:
opm_naar_postc=function(kolom1,kolom2,kolom3,kolom4,kolom5) {
postc=lapply(kolom1, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc1=lapply(kolom1, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc2=lapply(kolom2, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc3=lapply(kolom2, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc4=lapply(kolom3, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc5=lapply(kolom3, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc6=lapply(kolom4, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc7=lapply(kolom4, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc8=lapply(kolom5, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
postc9=lapply(kolom5, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
然后我想删除postc9中的任何空格,圆点,NAs等
postcodes=c("postc","postc1","postc2","postc3","postc4","postc5","postc6","postc7","postc8","postc9")
for (i in postcodes) {
i=gsub(" ","",i)
i=gsub("NA|[[:punct:]]","",i) }
最后,我将所有postc粘贴到postc9,因此剩下一个变量。这个变量是我的返回变量。 所以我把这个函数称为:
df = df %>% mutate(postcode=opm_naar_postc(var1,var2,var3,var4,var5))
首先,for循环不起作用(没有错误,但它没有做任何事情)。当我不使用for循环时,它确实有效。 其次,我想将所有10个应用规则放在一个for循环中,这可能吗?我尝试过很多东西,但它似乎没有用......
谁能帮帮我?
谢谢!
我的数据帧df的一个例子:
var1 var2 var3 var4 var5
blablaehdhde blablatext blabla 1983 rf blablatext blablatext
1982 rf blabla text blala blablbal blaakakk text hahahahah
blblatext textte8743GH sdkhflksfjf kjsnhblabla gagagagag
预期结果:
postcode
1983rf
1982rf
8743GH
答案 0 :(得分:1)
这是使用正则表达式的想法,
gsub('^\\D*?(\\d+)\\s?(\\D{2}).*$', '\\1\\2', grep('\\d+', unlist(df), value = TRUE))
# var12 var23 var31
#"1982rf" "8743GH" "1983rf"
答案 1 :(得分:0)
您可以尝试:
# your data
df <- structure(c("blablaehdhde", "1982 rf blabla", "blblatext", "blablatext",
"text blala", "textte8743GH", "blabla 1983 rf", "blablbal", "sdkhflksfjf",
"blablatext", "blaakakk text", "kjsnhblabla", "blablatext", "hahahahah",
"gagagagag"), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("var1",
"var2", "var3", "var4", "var5")))
# pipeline
library(tidyverse)
library(stringi)
as.tibble(df) %>%
gather() %>%
mutate(value=gsub(" ", "", value)) %>%
mutate(postcode=stri_extract_all_regex(value, "[0-9]+(.{2})", simplify =T)) %>%
filter(!is.na(postcode))
# A tibble: 3 x 3
key value postcode
<chr> <chr> <chr>
1 var1 1982rfblabla 1982rf
2 var2 textte8743GH 8743GH
3 var3 blabla1983rf 1983rf