一些示例数据:
id_trial
001_a.txt
001_a_t2.txt
949482_b.txt
949482_b_t2.txt
95_c.txt
95_c_t2.txt
注意:不同长度的字符串,但长度对于对减去“_t2”
对如何制作它,以便如果_t2
之前的字符串部分相同,则在新列中标记这两部分。
也就是说,我想要这样的东西:
id_trial subject
001_a.txt person_a
001_a_t2.txt person_a
949482_b.txt person_b
949482_b_t2.txt person_b
95_c.txt person_c
95_c_t2.txt person_c
即使这样也可行:
id_trial subject
001_a.txt a
001_a_t2.txt a
949482_b.txt b
949482_b_t2.txt b
95_c.txt c
95_c_t2.txt c
非常感谢任何帮助。
答案 0 :(得分:1)
您可以尝试sub
提取前缀部分
df1$subject <- sub('([^_]+_.).*', '\\1',sub('([^_]+)\\1+',
'\\1', df1$id_trial))
df1
# id_trial subject
#1 personn_a.txt person_a
#2 person_a_t2.txt person_a
#3 person_b.txt person_b
#4 person_b_t2.txt person_b
#5 personnn_c.txt person_c
#6 person_c_t2.txt person_c
如果您需要numeric
主题
as.numeric(factor(df1$subject))
#[1] 1 1 2 2 3 3
对于第二个数据集
df2$subject <- sub('\\d+_([a-z]+).*', '\\1', df2$id_trial)
df2
# id_trial subject
#1 001_a.txt a
#2 001_a_t2.txt a
#3 949482_b.txt b
#4 949482_b_t2.txt b
#5 95_c.txt c
#6 95_c_t2.txt c
df1 <- structure(list(id_trial = c("personn_a.txt", "person_a_t2.txt",
"person_b.txt", "person_b_t2.txt", "personnn_c.txt", "person_c_t2.txt"
)), .Names = "id_trial", class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(id_trial = c("001_a.txt", "001_a_t2.txt",
"949482_b.txt",
"949482_b_t2.txt", "95_c.txt", "95_c_t2.txt")), .Names = "id_trial",
class = "data.frame", row.names = c(NA, -6L))