I have a dataframe of the form:
B <- data.frame(B=c(rep(" 'abcefgh.abc_123.1_123.1'",length=50),
rep(" 'ab[+12.1]abcdefgh.abc_123.1_123.1'",length=50)))
I need to split this single column into 4 columns. My first attempt was to just use a for loop and the strsplit() command to cut up each observation and paste it back together in the desired format.
Bsplit <- data.frame()
for (i in 1:nrow(B)){
temp3 <- strsplit(as.character(B$B[i]),split='_', fixed= TRUE)
temp4 <- strsplit(temp3[[1]][1],split='.',fixed= TRUE)
if(is.na(temp4[[1]][3])){
bsplit <- data.frame(a=temp4[[1]][1],b=temp4[[1]][2],c=temp3[[1]][2],d=temp3[[1]][3])
Bsplit <- rbind(Bsplit,bsplit)
}
else {
bsplit <- data.frame(a=paste(temp4[[1]][1],'.',temp4[[1]][2],sep=''),b=temp4[[1]][3],
c=temp3[[1]][2],d=temp3[[1]][3])
Bsplit <- rbind(Bsplit,bsplit)
}
}
This gives the desired result but it is far to slow to be practical. On my second attempt I used a combination of the cSplit_f() command and stri_split_fixed().
library(stringi)
library(splitstackshape)
X <- cSplit_f(B,1,sep='_')
Y <- lapply(data.frame(X[[1]]),stri_split_fixed,pattern='.',simplify= TRUE)
The problem is, when a string takes the form 'ab[+12.1]abcdefgh.abc_123.1_123.1' r cuts the string like this 'ab[+12' | 'abcdefgh' | 'abc' | 123.1 | 123.1. How do I protect the string so it ignores the '.' separator and returns 'ab[+12.1]abcdefgh' | 'abc' | 123.1 | 123.1.
答案 0 :(得分:2)
Truly, there is little that increasingly more complex regular expressions cannot accomplish.
This approach is a little risky. It:
|
)..
).The appropriate choice of a sentinel character is important, as is the assumption that all of your nuisance characters are contained in symmetric square brackets.
library(tidyverse)
B <- data.frame(B=c(rep(" 'abcefgh.abc_123.1_123.1'",length=50),
rep(" 'ab[+12.1]abcdefgh.abc_123.1_123.1'",length=50)))
B %>%
mutate(B = gsub("(?<=\\[)([^\\.])*\\.([^\\.])*(?=\\])", "\\1|\\2", B, perl = T)) %>%
separate(B, into = c("a", "b", "c", "d"), sep = "\\.", extra = "merge") %>%
mutate_each(funs(gsub("\\|", "\\.", .))) %>%
tail
#> a b c d
#> 95 'ab[2.1]abcdefgh abc_123 1_123 1'
#> 96 'ab[2.1]abcdefgh abc_123 1_123 1'
#> 97 'ab[2.1]abcdefgh abc_123 1_123 1'
#> 98 'ab[2.1]abcdefgh abc_123 1_123 1'
#> 99 'ab[2.1]abcdefgh abc_123 1_123 1'
#> 100 'ab[2.1]abcdefgh abc_123 1_123 1'
答案 1 :(得分:2)
A base R attempt which makes use of regular expression grouping
:
mydf <- data.frame(B=c(rep(" 'abcefgh.abc_123.1_123.1'",length=50),
rep(" 'ab[+12.1]abcdefgh.abc_123.1_123.1'",length=50)))
new_df <- do.call(rbind, strsplit(gsub("(['\\w\\+\\.\\[]*)(\\]*)([a-z]+)(\\.)([\\w\\.']+)",
"\\1\\2\\3_\\5",
trimws(mydf$B),
perl = T), split = "_"))
new_df <- data.frame(new_df)
# Just a select number of rows
X1 X2 X3 X4
'abcefgh abc 123.1 123.1'
'abcefgh abc 123.1 123.1'
'abcefgh abc 123.1 123.1'
'abcefgh abc 123.1 123.1'
'abcefgh abc 123.1 123.1'
'abcefgh abc 123.1 123.1'
'ab[+12.1]abcdefgh abc 123.1 123.1'
'ab[+12.1]abcdefgh abc 123.1 123.1'
'ab[+12.1]abcdefgh abc 123.1 123.1'
'ab[+12.1]abcdefgh abc 123.1 123.1'
'ab[+12.1]abcdefgh abc 123.1 123.1'
'ab[+12.1]abcdefgh abc 123.1 123.1'
The idea here to group each row into 5 chunks and use gsub
to target the chunks that would constitute your new columns. I will use 'ab[+12.1]abcdefgh.abc_123.1_123.1'
as an example. Here, you want to group the string in the following chunks: 'ab[+12.1
, ]
, abcdefgh
, .
and abc_123.1_123.1'
, and then you can concatenate the groups back together except for the fourth group which is replaced with _
. At this point you have all the four columns you need, separated by _
. Subsequently, you can go right ahead and split your new row on _
to generate 4 different columns.
I hope this helps.