如果字符串的第一部分相同,则标记为相同

时间:2015-05-15 07:51:55

标签: r data-manipulation

一些示例数据:

    id_trial        
 001_a.txt          
 001_a_t2.txt       
 949482_b.txt       
 949482_b_t2.txt    
 95_c.txt           
 95_c_t2.txt        

注意:不同长度的字符串,但长度对于对减去“_t2”

如何制作它,以便如果_t2之前的字符串部分相同,则在新列中标记这两部分。 也就是说,我想要这样的东西:

    id_trial         subject
 001_a.txt           person_a
 001_a_t2.txt        person_a
 949482_b.txt        person_b
 949482_b_t2.txt     person_b
 95_c.txt            person_c
 95_c_t2.txt         person_c

即使这样也可行:

    id_trial         subject
 001_a.txt               a
 001_a_t2.txt            a
 949482_b.txt            b
 949482_b_t2.txt         b
 95_c.txt                c
 95_c_t2.txt             c

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

您可以尝试sub提取前缀部分

df1$subject <-   sub('([^_]+_.).*', '\\1',sub('([^_]+)\\1+',
          '\\1', df1$id_trial))
df1
#        id_trial  subject
#1   personn_a.txt person_a
#2 person_a_t2.txt person_a
#3    person_b.txt person_b
#4 person_b_t2.txt person_b
#5  personnn_c.txt person_c
#6 person_c_t2.txt person_c

如果您需要numeric主题

as.numeric(factor(df1$subject))
#[1] 1 1 2 2 3 3

更新

对于第二个数据集

df2$subject <- sub('\\d+_([a-z]+).*', '\\1', df2$id_trial)
df2
#         id_trial subject
#1       001_a.txt       a
#2    001_a_t2.txt       a
#3    949482_b.txt       b
#4 949482_b_t2.txt       b
#5        95_c.txt       c
#6     95_c_t2.txt       c

数据

df1 <-  structure(list(id_trial = c("personn_a.txt", "person_a_t2.txt", 
"person_b.txt", "person_b_t2.txt", "personnn_c.txt", "person_c_t2.txt"
)), .Names = "id_trial", class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(id_trial = c("001_a.txt", "001_a_t2.txt", 
"949482_b.txt", 
"949482_b_t2.txt", "95_c.txt", "95_c_t2.txt")), .Names = "id_trial", 
class = "data.frame", row.names = c(NA, -6L))