R:使用grep按重要性顺序查找一个或多个匹配项

时间:2017-04-06 11:29:32

标签: r string grep

我正在使用grep来整理一些地址数据,我的目标特别是识别给定记录和列中的街道/大道/道路名称等,这些已经按空格分成以下单个单词变量 tempval ,例如:

R > tempval
[1] "38"   "WILLOW" "PARK"  

我使用以下语句来查明街道名称后面的某些单词可能是:

  stID <- grep("STREET|\\bST\\b|AVENUE|\\bAVE\\b|\\bAV\\b|WAY|BOULEVARD|\\bBD\\b|ROAD|\\bRD\\b|PLACE|\\bPL\\b|ESPLANADE|TERRACE|PARADE|DRIVE|\\bDR\\b|\\bPARK\\b|LANE|CRESCENT|\\bCOURT\\b|b\\CRES\\b", tempval, ignore.case = T)

R > stID
[1] 3

这很好,我知道&#34; PARK&#34;是第3个元素,之前的内容将是我的街道号码和名称。

但是,当有多个匹配项length(stID) > 1时出现问题,例如:

R > tempval
[1] "38"   "PARK" "ST" 

所以在这里,我得到了

R > stID
[1] 2 3

如何让R按重要性顺序返回一个匹配项(我将字符串放在grep模式中的顺序)?换句话说,如果R同时发现&#34; ST&#34; &#34; PARK&#34;,&#34; ST&#34;更重要的是&#34; PARK&#34;因此只返回stID = 3

2 个答案:

答案 0 :(得分:3)

使用grep是非常危险的,因为您的grep即使会优先考虑 - 返回&#34;街头生活&#34;在街头生活公园&#34;街头生活公园&#34; (它会找到&#34;街道&#34;在&#34; streetlife&#34;)。

因此我建议您改用match。将所有内容转换为较低值,并使用具有重要性顺序值的向量。然后,您可以使用match查看x中与该向量匹配的位置。现在你必须寻找第一个不是NA并且你已经完成的值:

checkstreet <- function(x){
  x <- tolower(x)
  thenames <- c("street","st","avenue","ave","av",
                "way","boulevard", "bd", "road", "rd",
                "place", "pl", "esplanade","terrace","parade",
                "drive","dr","park","lane","crescent","court",
                "cres")

  id <- match(thenames, x)
  id[!is.na(id)][1]
}

给出:

> tmpval <- c("38","park","street")
> checkstreet(tmpval)
[1] 3
> tmpval <- c("44","Average","Esplanade")
> checkstreet(tmpval)
[1] 3

如果您坚持使用grep并继续使用\\b作为单词边界,则可以使用相同的逻辑,但这次使用which.min

checkstreet <- function(x){
  x <- tolower(x)
  thenames <- c("street","st","avenue","ave","av",
                "way","boulevard", "bd", "road", "rd",
                "place", "pl", "esplanade","terrace","parade",
                "drive","dr","park","lane","crescent","court",
                "cres")

  which.min(lapply(x,grep,thenames))
}

答案 1 :(得分:1)

您可以通过在循环中单独匹配每个搜索词然后对匹配进行评分来做到这一点,从而为搜索列表中较早的匹配提供更高的分数:

## Vector of search terms:
matchVec <- strsplit("STREET|\\bST\\b|AVENUE|\\bAVE\\b|\\bAV\\b|WAY|BOULEVARD|\\bBD\\b|ROAD|\\bRD\\b|PLACE|\\bPL\\b|ESPLANADE|TERRACE|PARADE|DRIVE|\\bDR\\b|\\bPARK\\b|LANE|CRESCENT|\\bCOURT\\b|b\\CRES\\b", "\\|")[[1]]

## Function to determine score of the match:
scoreMatch <- function(myString, matchVec){
    ## Position of matches in the search list:
    position <- which(vapply(matchVec, function(matchStr) grepl(pattern = matchStr, x = myString), 
                    logical(1)))
    ## Score: First search term gets the highest score, second gets second 
    ## highest score etc. No match = score 0:
    score <- ifelse(length(position) > 0, length(matchVec) - position + 1, 0)   
}

## Determine score of each element/word in your vector:
scoreVec <- vapply(tempval, function(x) scoreMatch(x, matchVec), numeric(1))

## Find index with the highest score:
stID <- which.max(scoreVec)