Question

是否有更“r”的方法从data.table中的列中将更长的字符串中的两个有意义的字符子串起来？

我有一个data.table，其中包含一个带有“度字符串”的列...有人获得的学位和毕业年份的简写代码。

> srcDT<- data.table(
    alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
    degree=c("W72","WG95","W88")
    )

> srcDT
               alum degree
1:      Paul Lennon    W72
2:  Stevadora Nicks   WG95
3:     Fred Murcury    W88

我需要从学位中提取年份的数字，并将其放在一个名为“degree_year”的新栏目中

没问题：

> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]

> srcDT
                alum degree degree_year
 1:      Paul Lennon    W72          72
 2:  Stevadora Nicks   WG95          95
 3:     Fred Murcury    W88          88

如果只是那么简单。问题是，度数字符串有时只是如上所示。更常见的是，它们看起来像这样：

srcDT<- data.table(
  alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
  degree=c("W72 C73","WG95 L95","W88 WG90")
)

我只对我关心的角色旁边的2个数字感兴趣：W＆amp;工作组（如果W和工作组都在那里，我只关心工作组）

以下是我如何解决它：

x <-srcDT$degree                     ##grab just the degree column
z <-character()                       ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
                                     ## define a vector of regex's, in the order
                                     ## I want them

for(i in 1:length(x)){               ## loop thru all elements in degree column
  matched=F                          ## at the start of the loop, reset flag to F
  for(j in 1:length(degree.grep.pattern)){
                                     ## loop thru all elements of the pattern vector

    if(length(grep(degree.grep.pattern[j],x[i]))>0){
                                     ## see if you get a match

      m <- regexpr(degree.grep.pattern[j],x[i])
                                     ## if you do, great! grab the index of the match
      y<-regmatches(x[i],m)          ## then subset down.  y will equal "WG95"
      matched=T                      ## set the flag to T
      break                          ## stop looping
    }
                                     ## if no match, go on to next element in pattern vector
  }

  if(matched){                       ## after finishing the loop, check if you got a match
    yr <- substr(y,nchar(y)-1,nchar(y))
                                     ## if yes, then grab the last 2 characters of it
  }else{
    #if you run thru the whole list and don't match any pattern at all, just
    # take the last two characters from the affilitation
    yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
  }
  z<-c(z,yr)                         ## add this result (95) to the character vector
}
srcDT$degree_year<-z                ## set the column to the results.

> srcDT
             alum   degree degree_year
1: Ringo Harrison  W72 C73          72
2:   Brian Wilson WG95 L95          95
3:   Mike Jackson W88 WG90          90

这很有效。 100％的时间。没有错误，没有错误匹配。问题是：它没有扩展。给定一个包含10k行或100k行的数据表，它确实会变慢。

有更聪明，更好的方法吗？这个解决方案对我来说非常“C”。不是很“R”。

关于改进的想法？

注意：我给出了一个简化的例子。在实际数据中，大约有30种不同的度数组合，并且结合不同的年份，有540种独特的度数字符串组合。另外，我给了仅有2个模式匹配的degree.grep.pattern。在我正在做的实际工作中，有7或8种模式可供匹配。

Answer 1

看起来（根据OP）评论，没有"WG W"的情况，那么一个简单的正则表达式解决方案就可以完成这项工作

srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
#              alum   degree degree_year
# 1: Ringo Harrison  W72 C73          72
# 2:   Brian Wilson WG95 L95          95
# 3:   Mike Jackson W88 WG90          90

Answer 2

这是一个基于以下假设的解决方案，其中包含W的最新学位：

regex <- "(?<=W|(?<=W)G)[0-9]{2}"

srcDT[ , degree_year := 
         sapply(regmatches(degree, 
                           gregexpr(regex, degree, perl = TRUE)),
                function(x) max(as.integer(x)))]

> srcDT
             alum   degree degree_year
1: Ringo Harrison  W72 C73          72
2:   Brian Wilson WG95 L95          95
3:   Mike Jackson W88 WG90          90

你说：

我只给了degree.grep.pattern两个模式来匹配。在我正在做的实际工作中，有7或8种模式可供匹配。

但我不确定这意味着什么。除了W和WG之外还有更多选项吗？

Answer 3

这是一个快速黑客：

# split all words from degree and order so that WG is before W
words <- lapply(strsplit(srcDT$degree, " "), sort, decreasing=TRUE)

# obtain tags for each row (getting only first. But works since ordered)
tags <- mapply(Find, list(function(x) grepl("^WG|^W", x)), words)

# simple gsub to remove WG and W
(result <- gsub("^WG|^W", "", tags))
[1] "72" "95" "90"

快速行100k行。

Answer 4

没有正则表达式的解决方案，它创建稀疏表格时速度很慢......但它干净灵活，所以我把它放在这里。

首先，我按空间分割学位年份，然后浏览它们并构建一个干净的结构化表格，每个学位一列，我用多年填充它。

degreeyear_split <- sapply(srcDT$degree,strsplit," ") 
for(i in 1:nrow(srcDT)){
  for (degree_year in degreeyear_split[[i]]){
    n <- nchar(degree_year)
    degree <- substr(degree_year,1,n-2)
    year <- substr(degree_year,n-1,n)
    srcDT[i,degree] <- year  
  }}

这里我有我的结构表，我在我感兴趣的年份粘贴W，然后将WG粘贴在它上面。

srcDT$year <- srcDT$W
srcDT$year[srcDT$WG!=""]<-srcDT$WG[srcDT$WG!=""]

然后是你的结果：

srcDT
             alum   degree  W  C WG  L year
1: Ringo Harrison  W72 C73 72 73         72
2:   Brian Wilson WG95 L95       95 95   95
3:   Mike Jackson W88 WG90 88    90      90

R

4 个答案: