Suppose that I have the following data set:
person location job
1 Joe TX Welder|Welder
2 Bob TX|TX Chef
3 Billy OK|OK|OK Teacher|Teacher
4 Denise MN Unemployed|Unemployed
5 Sasha KS|KS|KS|KS|KS Groomer|Groomer|Groomer
Notice that, for some people, there is some duplication for location and job. The duplication is preceded by a '|' character.
I'd like to cycle through all of the columns (except for the first one), identify where there's a '|' + duplication, and end up with the following table:
person location job
1 Joe TX Welder
2 Bob TX Chef
3 Billy OK Teacher
4 Denise MN Unemployed
5 Sasha KS Groomer
Thanks!
答案 0 :(得分:4)
We can use sub
. We match the pattern |
followed by one or more characters (.*
) to the end ($
) of the string and replace with ''
.
sub('\\|.*$', '', m1)
# person location job
#[1,] "Joe" "TX" "Welder"
#[2,] "Bob" "TX" "Chef"
#[3,] "Billy" "OK" "Teacher"
#[4,] "Denise" "MN" "Unemployed"
#[5,] "Sasha" "KS" "Groomer"
EDIT: The OP changed the matrix
to data.frame
. In that case, we can use mutate_each
from dplyr
and use sub
on each of the columns
library(dplyr)
d1 %>%
mutate_each(funs(sub('\\|.*$', '', .)))
# person location job
#1 Joe TX Welder
#2 Bob TX Chef
#3 Billy OK Teacher
#4 Denise MN Unemployed
#5 Sasha KS Groomer
Or we loop through the columns of 'd1' (lapply(..
), use sub
, and assign the output back to the original dataset to replace the values.
d1[] <- lapply(d1, sub, pattern='\\|.*$', replacement='')
m1 <- structure(c("Joe", "Bob", "Billy", "Denise", "Sasha", "TX",
"TX|TX",
"OK|OK|OK", "MN", "KS|KS|KS|KS|KS", "Welder|Welder", "Chef",
"Teacher|Teacher", "Unemployed|Unemployed", "Groomer|Groomer|Groomer"
), .Dim = c(5L, 3L), .Dimnames = list(NULL, c("person", "location",
"job")))
d1 <- as.data.frame(m1)
答案 1 :(得分:2)
You can also use:
library(splitstackshape)
cSplit(df, c('location','job'), sep='|')[,c('person','location_1','job_1'), with=F]
# person location_1 job_1
#1: Joe TX Welder
#2: Bob TX Chef
#3: Billy OK Teacher
#4: Denise MN Unemployed
#5: Sasha KS Groomer