通过data.table(R)循环grepl()

时间:2015-11-13 18:04:28

标签: regex r data.table data-cleaning

我有一个数据集存储为data.table DT,如下所示:

print(DT)
   category            industry
1: administration      admin
2: nurse practitioner  truck
3: trucking            truck
4: administration      admin
5: warehousing         nurse
6: warehousing         admin
7: trucking            truck
8: nurse practitioner  nurse         
9: nurse practitioner  truck 

我想将表格缩小为只有行业与该类别匹配的行。我的一般方法是使用grepl()正则表达式匹配字符串'^{{INDUSTRY}}[a-z ]+$'DT$category的每一行,并插入DT$industry的每个对应行代替{{INDUSTRY}}在使用infuse()的正则表达式字符串中。我很难找到一个时髦的data.table解决方案,它可以正确地循环遍历表并进行行内比较,所以我使用了for循环来完成工作:

template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
    ind <- DT[i]$industry
    categ <- d.daily[i]$category
    if (grepl(infuse(IND=ind,template),categ)){
        DT[i]$match <- TRUE
    }
}
DT<- DT[match==TRUE]
print(DT)
       category            industry
1: administration      admin
2: trucking            truck
3: administration      admin
4: trucking            truck
5: nurse practitioner  nurse         

但是,我确信这可以通过更好的方式完成。有关如何通过利用data.table包的功能实现此结果的任何建议?我的理解是,在这种情况下,使用包的方法可能比for-loop更有效。

3 个答案:

答案 0 :(得分:6)

您可以使用stringi::stri_detect_fixed()。它在strpattern上进行了矢量化。

DT[stringi::stri_detect_fixed(category, industry)]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse 

或者,可以使用stringr::str_detect()。它的两个参数也都是矢量化的。

library(stringr)
DT[str_detect(category, fixed(industry))]

或者基本R选项是通过grepl()

运行mapply()
DT[mapply(grepl, industry, category, fixed = TRUE)]

或者您可以使用Vectorize(grepl)获得相同的结果。

DT[Vectorize(grepl)(industry, category, fixed = TRUE)]

所有这些产生相同的结果。

数据:

DT <- structure(list(category = c("administration", "nurse practitioner", 
"trucking", "administration", "warehousing", "warehousing", "trucking", 
"nurse practitioner", "nurse practitioner"), industry = c("admin", 
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
-9L))
setDT(DT)

答案 1 :(得分:6)

只要匹配总是基于category字符串的开头,那么这很好用:

dt[substring(category, 1, nchar(industry)) == industry]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse

答案 2 :(得分:5)

Data.table擅长分组操作;我认为这是有用的,假设你有很多行具有相同的行业:

DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]

这使用the current idiom for subsetting by group, thanks to @eddi

评论。这些可能会有所帮助:

  • 如果您有许多行具有相同的行业类别组合,请尝试by=.(industry,category)

  • grep的地方尝试别的东西(就像肯和理查德的答案中的选项一样)。