Question

我在数据框中的列之一如下。需要获得如图所示的输出。

Data :
NM_001104633|0|Sema3d|-
NM_0011042|0|XYZ|-
NM_0956|0|ghd|+

Required output :
Sema3d
XYZ
ghd

Answer 1

我们可以使用read.table将它们分成不同的列，然后只选择我们感兴趣的列。

read.table(text = df$V1, sep = "|")

#           V1 V2     V3 V4
#1 NM_001104633  0 Sema3d  -
#2   NM_0011042  0    XYZ  -
#3      NM_0956  0    ghd  +

我们也可以为此使用tidyr::separate

tidyr::separate(df, V1, into = paste0("col1", 1:4), sep = "\\|")

或来自cSplit的{{1}}

splitstackshape

数据

splitstackshape::cSplit(df, "V1", sep = "|")

Answer 2

x = c("NM_001104633|0|Sema3d|-", "NM_0011042|0|XYZ|-", "NM_0956|0|ghd|+")
sub(".*0\\|(.*)\\|[+|-]", "\\1", x)
#[1] "Sema3d" "XYZ"    "ghd"

#OR
sapply(strsplit(x, "\\|"), function(s) s[3])
#[1] "Sema3d" "XYZ"    "ghd"

#OR
sapply(x, function(s){
    inds = gregexpr("\\|", s)[[1]]
    substring(s, inds[2] + 1, inds[3] - 1)
},
USE.NAMES = FALSE)
#[1] "Sema3d" "XYZ"    "ghd"

Answer 3

以下正则表达式接收最后一对|之间的所有文本，后跟+或-。

([^\|]*)(?=\|(\+|-))

Demo

Answer 4

我们可以使用sub中的base R

sub(".*\\|(\\w+)\\|[-+]$", "\\1", x)
#[1] "Sema3d" "XYZ"    "ghd"

或使用gsub

gsub(".*\\d+\\||\\|.*", "", x)
#[1] "Sema3d" "XYZ"    "ghd"

数据

x <- c("NM_001104633|0|Sema3d|-", "NM_0011042|0|XYZ|-", "NM_0956|0|ghd|+")

Answer 5

unglue 软件包提供了一种可读的替代方法，即使效率不高：

x = c("NM_001104633|0|Sema3d|-", "NM_0011042|0|XYZ|-", "NM_0956|0|ghd|+")
unglue::unglue_vec(x, "{drop1}|0|{keep}|{drop2}",var = "keep")
#> [1] "Sema3d" "XYZ"    "ghd"
# or
unglue::unglue_vec(x, "{=.*?}|0|{keep}|{=.*?}")
#> [1] "Sema3d" "XYZ"    "ghd"

或者直接在数据框中：

df <- data.frame(col = x)
unglue::unglue_unnest(df, col, "{=.*?}|0|{new_col}|{=.*?}")
#>   new_col
#> 1  Sema3d
#> 2     XYZ
#> 3     ghd

提取数据框列中的字符串子集

5 个答案:

数据