R:提取模式,不同时间

时间:2017-04-12 08:19:18

标签: r string extract

我遇到了以下问题:我有一个文本,由章节分隔并由向量存储。假设像:

text <- c("Here are information about topic1.", 
"Here are some information about topic2 or topic3.", 
"Chapter number 4 is really annoying.", 
"Topic4 is discussed in this chapter.")

我想提取不同章节中提到的不同主题。所以我的输出应该是这样的:

output
      [1]       [2]
[1] "topic1"
[2] "topic2" "topic3"
[3]
[4] "topic3"

所以我有一些行有多个发现,有些没有匹配。

我尝试使用str_extract_all并取消列表列表,但遇到导致行元素数量不同的问题。

感谢所有人!

1 个答案:

答案 0 :(得分:4)

您可以使用rbind.fill.matrix中的plyr

text <- c("Here are information about topic1.", 
          "Here are some information about topic2 or topic3.", 
          "Chapter number 4 is really annoying.", 
          "Topic4 is discussed in this chapter.")

library(stringr)
library(plyr)

xy <- str_extract_all(text, pattern = "[Tt]opic\\d+")
xy <- sapply(xy, FUN = function(x) matrix(x, nrow = 1))
rbind.fill.matrix(xy) # from plyr

     1        2       
[1,] "topic1" NA      
[2,] "topic2" "topic3"
[3,] NA       NA      
[4,] "Topic4" NA