我的数据集包含这样的描述
PosID description
2 Ubiquitin carboxyl-terminal hydrolase 14 OS=Homo sapiens GN=USP14 PE=1 SV=3
2 26S proteasome non-ATPase regulatory subunit 1 OS=Homo sapiens GN=PSMD1 PE=1 SV=2
1 Ankycorbin OS=Homo sapiens GN=RAI14 PE=2 SV=1
2 Alstrom syndrome protein 1 OS=Homo sapiens GN=ALMS1 PE=2 SV=1
1 26S protease regulatory subunit 6A OS=Homo sapiens GN=PSMC3 PE=1 SV=3
1 sp PSMC3_Human 26S protease regulatory subunit 6A OS=Homo sapiens PE=1 SV=3
我想提取出现在GN =之后或在最后一种情况下的空格之后出现的特定单词
这是我要求的输出
PosID description
2 USP14
2 PSMD1
1 RAI14
2 ALMS1
1 PSMC3
1 PSMC3
数据
df = structure(list(PosID = c(2L, 2L, 1L, 2L, 1L, 1L), description = structure(c(6L,
2L, 4L, 3L, 1L, 5L), .Label = c("26S protease regulatory subunit 6A OS=Homo sapiens GN=PSMC3 PE=1 SV=3",
"26S proteasome non-ATPase regulatory subunit 1 OS=Homo sapiens GN=PSMD1 PE=1 SV=2",
"Alstrom syndrome protein 1 OS=Homo sapiens GN=ALMS1 PE=2 SV=1",
"Ankycorbin OS=Homo sapiens GN=RAI14 PE=2 SV=1", " sp PSMC3_Human 26S protease regulatory subunit 6A OS=Homo sapiens PE=1 SV=3",
"Ubiquitin carboxyl-terminal hydrolase 14 OS=Homo sapiens GN=USP14 PE=1 SV=3"
), class = "factor")), .Names = c("PosID", "description"), class = "data.frame", row.names = c(NA,
-6L))
答案 0 :(得分:1)
使用str_match
stringr
的选项
library(stringr)
out = Reduce('rbind',
lapply(c("GN=([A-Z0-9]+)\\s", "\\s([A-Z0-9]+)_"),
function(x) str_match(df$description, x)[,2])
)
df$required = out[!is.na(out)]
#df[,-2]
# PosID required
#1 2 USP14
#2 2 PSMD1
#3 1 RAI14
#4 2 ALMS1
#5 1 PSMC3
#6 1 PSMC3