Question

      Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456

我有一列有75行变量，例如上面的col。我不太确定如何使用gsub或sub来直到第一个冒号之后的整数。

预期输出：

      Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456

我尝试了这个，但似乎不起作用：

gsub("*..:","", df$col)

Answer 1

以下内容也可能对您有帮助。

sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)

输出如下。

> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03"   "WBU-ARDU*08:01"   "WBU-ARFU*11:03"   "WBU-ARFU*03:456b"

数据帧的输入如下。

dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)

说明： 以下内容仅用于说明目的。

sub("      ##using sub for global subtitution function of R here.
([^:]*)     ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
:           ##Mentioning letter colon(:) here.
([^:]*)     ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*"         ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2"  ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat)    ##Mentioning df$dat dataframe's dat value.

Answer 2

您可以使用

df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)

请参见regex demo

详细信息

(\\d:\\d+)-捕获组1（可通过替换模式中的\1访问其值）：一个数字，冒号和1个以上的数字。
:-冒号
\\d+-1个以上数字
$-字符串的结尾。

R Demo：

col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03"  "WBU-ARDU*08:01"  "WBU-ARFU*11:03"  "WBU-ARFU*03:456"

替代方法：

df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)

请参见regex demo

在这里

^-字符串的开头
(.*?:\\d+)-第1组：任意0个以上的字符，越少越好（由于懒惰的*?量词），然后是:和1个以上的数字
.*-字符串的其余部分。

但是，它应与PCRE regex引擎一起使用，并通过perl=TRUE：

col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03"  "WBU-ARDU*08:01"  "WBU-ARFU*11:03"  "WBU-ARFU*03:456"

请参见R online demo。

Answer 3

sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03"  "WBU-ARDU*08:01"  "WBU-ARFU*11:03"  "WBU-ARFU*03:456"

或者用stringi匹配您想要的内容（而不是减去您不需要的内容）：

stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")

更加简洁stringr：

stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

使用gsub或sub函数仅获取字符串的一部分？

3 个答案: