从文本行中提取第一个和最后一个字符串的问题由'>'在数据框中

时间:2018-06-08 05:42:46

标签: r regex string dplyr substring

我有一个看起来像这样的数据框:

 DF$Lst 
 [1] "Some text > in > a string"
     "Another > text in > another > set of string"  
     "This is only > One text"
     "NA"
     ..... so forth

如果您注意到这一点,则每行都有一个由'>'

分隔的文本

我想创建' TWO'新列应该只有第一个字符串和最后一个字符串,例如:

  Text                                         Col1         Col2
  Some text > in > a string                    Some text    a string
  Another > text in > another > set of string  Another      set of string

我正在尝试使用函数:

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}

substrRight(x, 6)

但我认为这不是正确的做法。因为上述功能没有帮助。我们能有更好的解决问题吗?

2 个答案:

答案 0 :(得分:2)

我们可以使用extract

中的tidyr
library(tidyverse)
DF %>% 
 extract(Text, into = c('Col1', 'Col2'), '^([^>]+) >.* > ([^>]+)$', 
       remove = FALSE)
#                                       Text      Col1          Col2
#1                   Some text > in > a string Some text      a string
#2 Another > text in > another > set of string   Another set of string

base R上的split>,然后获取第一个和最后一个元素

DF[c('Col1', 'Col2')] <- t(sapply(strsplit(DF$Text, " > "),
             function(x) c(x[1], x[length(x)])))

更新

在更新的数据集&#39; DF3&#39;中,NAs是字符串。我们可以将其转换为真正的NAs

is.na(DF3$Text) <- DF3$Text == "NA"
DF3[c('Col1', 'Col2')] <- t(sapply(strsplit(DF3$Text, " > "),
       function(x) c(x[1], x[length(x)])))
DF3
#                                         Text      Col1          Col2
#1                   Some text > in > a string Some text      a string
#2 Another > text in > another > set of string   Another set of string
#3                               This > is one      This        is one
#4                                        <NA>      <NA>          <NA>

或类似于@ Onyambu的模式

 DF3 %>%
   extract(Text, into = c("Col1", "Col2"), 
               "^([^>]*)>(?:.*>)?([^>]*)$", remove = FALSE)
 #                                       Text       Col1           Col2
 #1                   Some text > in > a string Some text        a string
 #2 Another > text in > another > set of string   Another   set of string
 #3                               This > is one      This          is one
 #4                                        <NA>       <NA>           <NA>

数据

DF <- structure(list(Text = c("Some text > in > a string", 
 "Another > text in > another > set of string"
)), .Names = "Text", row.names = c(NA, -2L), class = "data.frame")   



DF3 <- structure(list(Text = c("Some text > in > a string",
"Another > text in > another > set of string", "This > is one", "NA")), 
 .Names = "Text", row.names = c(NA, -4L), class = "data.frame")

答案 1 :(得分:2)

Base R版本:

text=DF$Lst# Will assume this is given
read.table(text=sub(">.*>",">",text),sep=">")
          V1             V2
1 Some text        a string
2   Another   set of string


cbind(text,read.table(text=sub(">.*>",">",text),sep=">"))
                                         text         V1             V2
1                   Some text > in > a string Some text        a string
2 Another > text in > another > set of string   Another   set of string

另一种基础R方法:

data.frame(do.call(rbind,regmatches(text,regexec("(.*)>.*>(.*)",text))))
                                           X1                 X2             X3
1                   Some text > in > a string         Some text        a string
2 Another > text in > another > set of string Another > text in   set of string

编辑:

read.table(text=sub("(^.*?)>(?:.*>)*(.*$)","\\1>\\2",text),sep=">",fill = T,na.strings = "")
             V1             V2
1    Some text        a string
2      Another   set of string
3 This is only        One text
4            NA           <NA>

或者你可以这样做:

read.table(text=sub("(^[^>]*).*?([^>]*$)","\\1>\\2",text),sep=">",fill = T,na.strings = "")
             V1             V2
1    Some text        a string
2      Another   set of string
3 This is only        One text
4          <NA>             NA

使用separate

 separate(data.frame(text),text,c("col1","col2"),"((?:>.*)>|>)",fill="right" )
           col1           col2
1    Some text        a string
2      Another   set of string
3 This is only        One text
4            NA           <NA>