优化正则表达式 - 匹配,提取,删除

时间:2013-12-18 03:20:26

标签: regex r optimization

我使用以下2个函数来查找字符串中的国家/地区名称, 匹配名称,将其放入数据框中的新列, 然后从原始字符串中删除国家/地区名称:

library("stringr")

ListofCountries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")

CoffeeTable <- data.frame(Product=c("Kenya Ndumberi", "Kenya Ndumberi", "Finca Nombre de Dios", "Finca La Providencia", "Las Penidas", "Las Penidas", "Las Penidas", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Progresso", "Progresso", "Progresso", "Progresso", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "\nEl Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Geisha", "Geisha", "Geisha", "Pacamara", "Pacamara", "Pacamara", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Brazil yellow bourbon pea berry", "Finca El Vintilador", "\nWashed Yirgacheffe", "Finca El Vintilador", "Washed Yirgacheffe", "Washed Yirgacheffe", "Washed Yirgacheffe", "Leza", "Finca La Libertad", "Pacamara", "Pacamara", "Pacamara", "Finca La Bolsa", "Thunguri Kenya", "Thunguri Kenya", "Thunguri Kenya", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Pedregal", "Pedregal", "Barrel Aged", "Pedregal", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "Amigo de Buesaco", "Amigo de Buesaco", "Amigo de Buesaco", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "\nToarco Jaya Peaberry Sulawesi", "El Cypress", "El Cypress", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro"))

CoffeeTable$Country <- str_trim(str_match(tolower(CoffeeTable$Product), 
                                            tolower(paste(ListofCountries, collapse="|")))[,1])


CoffeeTable$Product <- str_trim(gsub(tolower(paste(ListofCountries, collapse="|")), replacement="", 
                          CoffeeTable$Product, ignore.case=T))

问题1 - 这很慢。如何更快地完成这些功能?

问题2 - 这只捕获正式的国家名称。有谁知道一个很好的常见国名? (例如'中国'与'中华民国')

谢谢!


编辑:以下是90个咖啡名称的列表,以使其成为可重复的示例; 我想在我的实际应用程序中添加,CoffeeTable已经存在并且有大约2,000行和45列。我不是在寻找更快的方法来构建data.frame /等等。

谢谢!

编辑2:问题2已经回答,现在我只是想优化2个功能,这样他们就不需要花费5到10秒的时间来运行!

2 个答案:

答案 0 :(得分:0)

对于您的第二个问题,有一个广泛的选项列表here。试试这个:

countries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")

编辑以回应OP的评论。

根据您提供的示例数据,并复制25X以创建与实际数据中相同数量的行,您的代码将在大约1.6秒内运行。很难相信你的系统和我的系统之间存在8倍的差异,所以必然会有其他事情发生。

我唯一可以推荐的是查看strapplyc(...)包中的gsubfn。这应该是非常有效的,但在我的系统上实际上比你的代码慢。

请参阅下面的代码以获取示例和基准。对不起,我无法提供更多帮助......

library(stringr)
df <- CoffeeTable
df$Product=as.vector(df$Product)
df=rbind(df,df,df,df,df)    # replicate 25X
df=rbind(df,df,df,df,df)    # total rows = 2250

pattern    <- tolower(paste(ListofCountries$name,collapse="|"))

f1 = function(df){
  df$Country <- str_trim(str_match(tolower(df$Product), pattern)[,1])
  df$Product <- str_trim(gsub(pattern, "",df$Product, ignore.case=T))
  return(df)
}

library(gsubfn)
library(tcltk2)
f2 = function(df){
  df$Country <- strapplyc(tolower(df$Product),pattern)
  df$Product <- str_trim(gsub(pattern,"", df$Product, ignore.case=T))
  return(df)
}

library(microbenchmark)
microbenchmark(df1<-f1(df),df2<-f2(df),times=10)
# Unit: seconds
#           expr      min       lq   median       uq      max neval
#  df1 <- f1(df) 1.365222 1.506017 1.611458 1.689611 1.722626    10
#  df2 <- f2(df) 2.006162 2.055963 2.148158 2.249707 2.285955    10

答案 1 :(得分:0)

好的,回到第一个问题。这可能不是最有效的解决方案,但它确实有效。

我建议的第一件事是,在最初生成CoffeeTable数据框时指定stringsAsFactors = FALSE。否则你最终会有因素。我还将此表中的初始数据列重命名为Composite,以便您可以看到分离的结果。

match <- gregexpr(tolower(paste(ListofCountries$name, collapse="|")),
    tolower(CoffeeTable$Composite))
CoffeeTable$Country <- sapply(regmatches(CoffeeTable$Composite, match),
    function(m) {ifelse(length(m), m, "")})
CoffeeTable$Product <- sapply(regmatches(CoffeeTable$Composite, match, invert = TRUE),\
    function(m) {paste0(m, collapse = "")})

结果如下:

> head(CoffeeTable, 10)
              Composite Country              Product
1        Kenya Ndumberi   Kenya             Ndumberi
2        Kenya Ndumberi   Kenya             Ndumberi
3  Finca Nombre de Dios         Finca Nombre de Dios
4  Finca La Providencia         Finca La Providencia
5           Las Penidas                  Las Penidas
6           Las Penidas                  Las Penidas
7           Las Penidas                  Las Penidas
8         Panama Duncan  Panama               Duncan
9         Panama Duncan  Panama               Duncan
10        Panama Duncan  Panama               Duncan