Cycle through columns to find duplicate/character combinations

时间:2015-09-14 15:52:48

标签: r dataframe

Suppose that I have the following data set:

  person       location                     job
1    Joe             TX           Welder|Welder
2    Bob          TX|TX                    Chef
3  Billy       OK|OK|OK         Teacher|Teacher
4 Denise             MN   Unemployed|Unemployed
5  Sasha KS|KS|KS|KS|KS Groomer|Groomer|Groomer

Notice that, for some people, there is some duplication for location and job. The duplication is preceded by a '|' character.

I'd like to cycle through all of the columns (except for the first one), identify where there's a '|' + duplication, and end up with the following table:

  person location        job
1    Joe       TX     Welder
2    Bob       TX       Chef
3  Billy       OK    Teacher
4 Denise       MN Unemployed
5  Sasha       KS    Groomer  

Thanks!

2 个答案:

答案 0 :(得分:4)

We can use sub. We match the pattern | followed by one or more characters (.*) to the end ($) of the string and replace with ''.

sub('\\|.*$', '', m1)
#      person   location job         
#[1,] "Joe"    "TX"     "Welder"    
#[2,] "Bob"    "TX"     "Chef"      
#[3,] "Billy"  "OK"     "Teacher"   
#[4,] "Denise" "MN"     "Unemployed"
#[5,] "Sasha"  "KS"     "Groomer"   

EDIT: The OP changed the matrix to data.frame. In that case, we can use mutate_each from dplyr and use sub on each of the columns

library(dplyr)
d1 %>%
     mutate_each(funs(sub('\\|.*$', '', .)))
#  person location        job
#1    Joe       TX     Welder
#2    Bob       TX       Chef
#3  Billy       OK    Teacher
#4 Denise       MN Unemployed
#5  Sasha       KS    Groomer

Or we loop through the columns of 'd1' (lapply(..), use sub, and assign the output back to the original dataset to replace the values.

d1[] <- lapply(d1, sub, pattern='\\|.*$', replacement='')

data

m1 <- structure(c("Joe", "Bob", "Billy", "Denise", "Sasha", "TX", 
"TX|TX", 
"OK|OK|OK", "MN", "KS|KS|KS|KS|KS", "Welder|Welder", "Chef", 
"Teacher|Teacher", "Unemployed|Unemployed", "Groomer|Groomer|Groomer"
), .Dim = c(5L, 3L), .Dimnames = list(NULL, c("person", "location",
"job")))

d1 <- as.data.frame(m1)

答案 1 :(得分:2)

You can also use:

library(splitstackshape)
cSplit(df, c('location','job'), sep='|')[,c('person','location_1','job_1'), with=F]
#   person location_1      job_1
#1:    Joe         TX     Welder
#2:    Bob         TX       Chef
#3:  Billy         OK    Teacher
#4: Denise         MN Unemployed
#5:  Sasha         KS    Groomer