Question

我有一个dataframe，其中包含time列和string列。我想subset这个dataframe - 我只保留string列中包含$符号的行。

在进行子集化后，我想清除string列，使其仅包含characters符号后面的$，直到有space或{{1} }

symbol

我希望最终输出为：

df <- data.frame("time"=c(1:10),
"string"=c("$ABCD test","test","test $EFG test",
"$500 test","$HI/ hello","test $JK/",
"testing/123","$MOO","$abc","123"))

它只保留字符串列中包含Time string 1 ABCD 3 EFG 4 500 5 HI 6 JK 8 MOO 9 abc的行，然后仅保留$符号后的字符，直到$或space

我symbol取得了一些成功，只是提取sub，但无法将其应用于string并对其进行子集化。谢谢你的帮助。

Answer 1

在有人提出漂亮的regex解决方案之前，我的意思是：

# subset for $ signs and convert to character class
res <- df[ grepl("$", df$string, fixed = TRUE),]
res$string <- as.character(res$string)

# split on non alpha and non $, and grab the one with $, then remove $
res$clean <- sapply(strsplit(res$string, split = "[^a-zA-Z0-9$']", perl = TRUE),
                    function(i){
                      x <- i[grepl("$", i, fixed = TRUE)]
                      # in case when there is more than one $
                      # x <- i[grepl("$", i, fixed = TRUE)][1]
                      gsub("$", "", x, fixed = TRUE)
                    })
res
#   time         string clean
# 1    1     $ABCD test  ABCD
# 3    3 test $EFG test   EFG
# 4    4      $500 test   500
# 5    5     $HI/ hello    HI
# 6    6      test $JK/    JK
# 8    8           $MOO   MOO
# 9    9           $abc   abc

Answer 2

我们可以通过使用regexpr/regmatches提取子字符串来仅提取$

之后的子字符串来实现此目的

i1 <- grep("$", df$string, fixed = TRUE)
transform(df[i1,], string = regmatches(string, regexpr("(?<=[$])\\w+", string, perl = TRUE)))
#    time string
#1    1   ABCD
#3    3    EFG
#4    4    500
#5    5     HI
#6    6     JK
#8    8    MOO
#9    9    abc

或使用tidyverse语法

library(tidyverse)
df %>% 
   filter(str_detect(string, fixed("$")))  %>%
   mutate(string = str_extract(string, "(?<=[$])\\w+"))

如果字符串列

2 个答案: