Splitting character vector into data frame when the separating character is in the string

时间:2016-10-20 18:55:43

标签: r dataframe

I have a dataframe of the form:

B <- data.frame(B=c(rep(" 'abcefgh.abc_123.1_123.1'",length=50),
                    rep(" 'ab[+12.1]abcdefgh.abc_123.1_123.1'",length=50)))

I need to split this single column into 4 columns. My first attempt was to just use a for loop and the strsplit() command to cut up each observation and paste it back together in the desired format.

Bsplit <- data.frame()
for (i in 1:nrow(B)){
  temp3 <- strsplit(as.character(B$B[i]),split='_', fixed= TRUE)
  temp4 <- strsplit(temp3[[1]][1],split='.',fixed= TRUE)
  if(is.na(temp4[[1]][3])){
    bsplit <- data.frame(a=temp4[[1]][1],b=temp4[[1]][2],c=temp3[[1]][2],d=temp3[[1]][3])
    Bsplit <- rbind(Bsplit,bsplit)
  }
  else {
    bsplit <- data.frame(a=paste(temp4[[1]][1],'.',temp4[[1]][2],sep=''),b=temp4[[1]][3],
              c=temp3[[1]][2],d=temp3[[1]][3])
    Bsplit <- rbind(Bsplit,bsplit)
  }
}

This gives the desired result but it is far to slow to be practical. On my second attempt I used a combination of the cSplit_f() command and stri_split_fixed().

library(stringi)
library(splitstackshape)

X <- cSplit_f(B,1,sep='_')
Y <- lapply(data.frame(X[[1]]),stri_split_fixed,pattern='.',simplify= TRUE)

The problem is, when a string takes the form 'ab[+12.1]abcdefgh.abc_123.1_123.1' r cuts the string like this 'ab[+12' | 'abcdefgh' | 'abc' | 123.1 | 123.1. How do I protect the string so it ignores the '.' separator and returns 'ab[+12.1]abcdefgh' | 'abc' | 123.1 | 123.1.

2 个答案:

答案 0 :(得分:2)

Truly, there is little that increasingly more complex regular expressions cannot accomplish.

This approach is a little risky. It:

  1. Identifies all nuisance characters within the square brackets.
  2. Replaces those with a sentinel character (I chose |).
  3. Splits the string on your separator.
  4. Modifies all columns to change the sentinel character back to a period (.).

The appropriate choice of a sentinel character is important, as is the assumption that all of your nuisance characters are contained in symmetric square brackets.

library(tidyverse)

B <- data.frame(B=c(rep(" 'abcefgh.abc_123.1_123.1'",length=50),
                    rep(" 'ab[+12.1]abcdefgh.abc_123.1_123.1'",length=50)))

B %>%
  mutate(B = gsub("(?<=\\[)([^\\.])*\\.([^\\.])*(?=\\])", "\\1|\\2", B, perl = T)) %>%
  separate(B, into = c("a", "b", "c", "d"), sep = "\\.", extra = "merge") %>%
  mutate_each(funs(gsub("\\|", "\\.", .))) %>%
  tail

#>                     a       b     c  d
#> 95   'ab[2.1]abcdefgh abc_123 1_123 1'
#> 96   'ab[2.1]abcdefgh abc_123 1_123 1'
#> 97   'ab[2.1]abcdefgh abc_123 1_123 1'
#> 98   'ab[2.1]abcdefgh abc_123 1_123 1'
#> 99   'ab[2.1]abcdefgh abc_123 1_123 1'
#> 100  'ab[2.1]abcdefgh abc_123 1_123 1'

答案 1 :(得分:2)

A base R attempt which makes use of regular expression grouping:

Data:

mydf <- data.frame(B=c(rep(" 'abcefgh.abc_123.1_123.1'",length=50),
                rep(" 'ab[+12.1]abcdefgh.abc_123.1_123.1'",length=50)))

Code:

new_df <- do.call(rbind, strsplit(gsub("(['\\w\\+\\.\\[]*)(\\]*)([a-z]+)(\\.)([\\w\\.']+)",
                             "\\1\\2\\3_\\5",
                             trimws(mydf$B),
                             perl = T), split = "_"))
new_df <- data.frame(new_df)

Output:

# Just a select number of rows
 X1                 X2  X3    X4    
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'

Explanation:

The idea here to group each row into 5 chunks and use gsub to target the chunks that would constitute your new columns. I will use 'ab[+12.1]abcdefgh.abc_123.1_123.1' as an example. Here, you want to group the string in the following chunks: 'ab[+12.1, ], abcdefgh, . and abc_123.1_123.1', and then you can concatenate the groups back together except for the fourth group which is replaced with _. At this point you have all the four columns you need, separated by _. Subsequently, you can go right ahead and split your new row on _ to generate 4 different columns.

I hope this helps.