从具有重复结构的字符串中提取值

时间:2013-02-26 01:08:44

标签: string r

假设我有一个字符串向量,如下所示:

vectorOfStrings <- c("Name: Andrew, College: Bradford",
                     "Name: Charlie Daniels, College: Easton College",
                     "Name: Frank Gehry, III, College: Highlands University")

其中有一个明确的重复“姓名:”,“,大学:”模式。

我想生成一个如下所示的列表(或data.frame):

listOfValues <- list(c("Andrew", "Charlie Daniels", "Frank Gehry, III"),
                     c("Bradford", "Easton College", "Highlands University"))

vectorOfStringslistOfValues的最简单方法是什么?我对base字符串操作函数以及stringr相当熟悉,但我认为这是一种相对常见的情况,我希望有一个相对完善的解决方案。< / p>

提前致谢。

5 个答案:

答案 0 :(得分:4)

以下是两种可能的解决方案:

(1)strapplyc mat语句创建一个矩阵,其第一列包含名称,第二列包含大学。然后我们将其转换为最后一个语句中的未命名列表:

library(gsubfn)

pat <- "Name: (.*), College: (.*)"
mat <- strapplyc(vectorOfStrings, pat, simplify = rbind)

unname(as.list(as.data.frame(mat, stringsAsFactors = FALSE)))

(2)gsub / read.table 仅使用普通R的变体是使用gsubpat从上面将每个输入字符串转换为以管道分隔的包含数据但不包含标签的字符串。用read.table读取,会给出一个数据框DF。最后,我们将DF转换为未命名的列表:

g <- gsub(pat, "\\1|\\2", vectorOfStrings)
DF <- read.table(text = g, sep = "|", as.is = TRUE)

unname(as.list(DF))

增加:第二个解决方案

答案 1 :(得分:3)

我喜欢数学咖啡的想法,但是因为我已经写了这个,所以这是另一种可能性:

X <- strsplit(vectorOfStrings, ",\\s*(?=College:)", perl=TRUE)
do.call(rbind, lapply(X, function(X) gsub("(Name|College):\\s*", "", X)))
#      [,1]               [,2]                  
# [1,] "Andrew"           "Bradford"            
# [2,] "Charlie Daniels"  "Easton College"      
# [3,] "Frank Gehry, III" "Highlands University"

答案 2 :(得分:2)

  do.call(rbind, strsplit(unlist(
            strsplit(vectorOfStrings, "Name: ")), ", College: "))

       [,1]               [,2]                  
  [1,] "Andrew"           "Bradford"            
  [2,] "Charlie Daniels"  "Easton College"      
  [3,] "Frank Gehry, III" "Highlands University"

<小时/>

似乎已经有很多好的答案了。与@Josh O'Brien相似,我会使用strsplit。

由于您没有保留"Name""College",因此可以直接拆分。然后你只需将它包裹在do.call(rbind, ___)中,它将自动删除由分割创建的任何空字符串。

答案 3 :(得分:1)

我使用regexp使用perl=T执行此类操作(否则找不到提取捕获组的好方法):

m <- regexpr('^Name: *(.+), *College: (.+) *$',
             vectorOfStrings, perl=T)
# m looks like this:
# [1] 1 1 1
# attr(,"match.length")
# [1] 31 46 53
# attr(,"useBytes")
# [1] TRUE
# attr(,"capture.start")  # one column per capturing bracket,   
# [1,] 7 24               # one row per entry in vectorOfStrings
# [2,] 7 33
# [3,] 7 34
# attr(,"capture.length")    
# [1,]  6  8
# [2,] 15 14
# [3,] 16 20
# attr(,"capture.names")
# [1] "" ""

# laziness
st <- attr(m, 'capture.start')
en <- st + attr(m, 'capture.length') - 1
numCaptures <- ncol(st)

matches <- sapply(1:numCaptures, function (i) {
    return(substr(vectorOfStrings, st[, i], en[, i]))
})

# matches
#     [,1]               [,2]                  
# [1,] "Andrew"           "Bradford"            
# [2,] "Charlie Daniels"  "Easton College"      
# [3,] "Frank Gehry, III" "Highlands University"

现在按下matches到您想要的表格。我经常将它包装在一个函数中,因为我经常使用它。

您甚至可以像这样使用Perl命名正则表达式:

m <- regexpr('^Name: *(?<name>.+), *College: (?<college>.+) *$',
             vectorOfStrings, perl=T)

然后attr(m, 'capture.names')c('name', 'college')colnames(attr(m, 'capture.(start or length)'))也会c('name', 'college')

无论如何,密钥似乎在使用perl=T,否则regexpr不会为每个捕获括号返回一组开始/结束点。

答案 4 :(得分:1)

使用反向引用可能更简单

> vectorOfStrings
[1] "Name: Andrew, College: Bradford"                       "Name: Charlie Daniels, College: Easton College"       
[3] "Name: Frank Gehry, III, College: Highlands University"
> list(gsub('^Name:(.*), College:(.*)$',"\\1", vectorOfStrings) , gsub('^Name:(.*), College:(.*)$',"\\2", vectorOfStrings))
[[1]]
[1] " Andrew"           " Charlie Daniels"  " Frank Gehry, III"

[[2]]
[1] " Bradford"             " Easton College"       " Highlands University"