Question

我的数据集包含这样的描述

PosID   description
2        Ubiquitin carboxyl-terminal hydrolase 14 OS=Homo sapiens GN=USP14 PE=1 SV=3
2        26S proteasome non-ATPase regulatory subunit 1 OS=Homo sapiens GN=PSMD1 PE=1 SV=2
1        Ankycorbin OS=Homo sapiens GN=RAI14 PE=2 SV=1
2        Alstrom syndrome protein 1 OS=Homo sapiens GN=ALMS1 PE=2 SV=1
1        26S protease regulatory subunit 6A OS=Homo sapiens GN=PSMC3 PE=1 SV=3
1        sp PSMC3_Human 26S protease regulatory subunit 6A OS=Homo sapiens PE=1 SV=3

我想提取出现在GN =之后或在最后一种情况下的空格之后出现的特定单词

这是我要求的输出

PosID   description
2       USP14
2       PSMD1
1       RAI14
2       ALMS1
1       PSMC3
1       PSMC3

数据

df = structure(list(PosID = c(2L, 2L, 1L, 2L, 1L, 1L), description = structure(c(6L, 
2L, 4L, 3L, 1L, 5L), .Label = c("26S protease regulatory subunit 6A OS=Homo sapiens GN=PSMC3 PE=1 SV=3", 
"26S proteasome non-ATPase regulatory subunit 1 OS=Homo sapiens GN=PSMD1 PE=1 SV=2", 
"Alstrom syndrome protein 1 OS=Homo sapiens GN=ALMS1 PE=2 SV=1", 
"Ankycorbin OS=Homo sapiens GN=RAI14 PE=2 SV=1", "  sp PSMC3_Human 26S protease regulatory subunit 6A OS=Homo sapiens PE=1 SV=3", 
"Ubiquitin carboxyl-terminal hydrolase 14 OS=Homo sapiens GN=USP14 PE=1 SV=3"
), class = "factor")), .Names = c("PosID", "description"), class = "data.frame", row.names = c(NA, 
-6L))

Answer 1

使用str_match

中stringr的选项

library(stringr)
out = Reduce('rbind',
             lapply(c("GN=([A-Z0-9]+)\\s", "\\s([A-Z0-9]+)_"), 
             function(x) str_match(df$description, x)[,2])
            )
df$required = out[!is.na(out)]

#df[,-2]
#  PosID required
#1     2    USP14
#2     2    PSMD1
#3     1    RAI14
#4     2    ALMS1
#5     1    PSMC3
#6     1    PSMC3

在R中提取字符串中的单词

1 个答案: