使用grepl将列中的字符串匹配到数据集

时间:2018-07-19 10:16:38

标签: r regex left-join grepl

我有两个Excel文件,我想匹配两个字符串,分别引用第一个数据集第2列中的公司和第二个excel文件中的colum 1。在这种情况下,例如BPET LIMITEDBPET LTD。 excel文件如下所示:

**ywOExport22** Company name   "year"      X    Y   Z 
1.  BLAFARMERS LIMITED          2017    1234    1   5
2.  COTTONBALLS GROUP LIMITED   2017    1254    2   8
3.  RIO JANEIRO LIMITED         2017    5233    
4.  BPET LIMITED                2017    6954    7   2
5.  TELOPSTRA CORPORATION       2017    4569    5   1

**X20131403** Name         ABN      Income $         Taxable $
21ST AGE HOLDINGS PTY LTD  555454   464         
A.C.N.A.BPTY LIMITED       546546   5553            
ABBA HOLDINGS PTY LTD      455564   56               54646  
BPET LTD                   546454   6546             44545  
ACCOLADE  PTY LIMITED      464651   5456        

我想在两个excel文件中创建一个匹配列,对另一个列进行“模糊匹配”,然后通过匹配将另一个左连接。我尝试了以下代码:

X20131403$match <- 0
ywOExport22$match <- 0

ywOExport22$match <- mapply(grepl(ywOExport22[,2], X20131403[,1], ignore.case = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE))

X20131403$match <- X20131403[,1]
ywOExport22 <- left_join(ywOExport22, X20131403, by="match")

输出:

> ywOExport22$match <- mapply(grepl(ywOExport22[,2], X20131403[,1], ignore.case = TRUE, perl = FALSE,
+                                       fixed = FALSE, useBytes = FALSE))
Error in match.fun(FUN) : 
  c("'grepl(ywOExport22[, 2], X20131403[, 1], ignore.case = TRUE, ' ist nicht Funktion, Zeichen oder Symbol", "'    perl = FALSE, fixed = FALSE, useBytes = FALSE)' ist nicht Funktion, Zeichen oder Symbol")
In addition: Warning message:
In grepl(ywOExport22[, 2], X20131403[, 1], ignore.case = TRUE,  :
  argument 'pattern' has length > 1 and only the first element will be used
> 
> X20131403$match <- X20131403[,1]
> ywOExport22 <- left_join(ywOExport22, X20131403, by="match")
Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) : 
  Can't join on 'match' x 'match' because of incompatible types (character / numeric)

所需的输出:

Company name               MATCH    ABN        Income $ Taxable$
BLAFARMERS LIMITED              
COTTONBALLS GROUP LIMITED               
RIO JANEIRO LIMITED             
BPET LIMITED               BPET LTD 5464545452  65466   445
TELOPSTRA CORP LIMITED      

关于如何修复我的代码的任何建议?

1 个答案:

答案 0 :(得分:0)

set.seed(101)

firstSet <- data.frame(
  Company = c('BLAFARMERS LIMITED', 'COTTONBALLS GROUP LIMITED', 
              'RIO JANEIRO LIMITED', 'BPET LIMITED',
              'TELOPSTRA CORPORATION'),
  Year = rep(2017, times = 5),
  X = runif(5)
)

secondSet <- data.frame(
  Company = c('ST AGE HOLDINGS PTY LTD', 'A.C.N.A.BPTY LIMITED', 
              'ABBA HOLDINGS PTY LTD', 'BPET LTD',
              'ACCOLADE PTY LIMITED'),
  Income = floor(runif(5, 0, 100))
)

secondSet$MATCH <- secondSet$Company
gsub(
  pattern = 'LTD', 
  replacement = 'LIMITED', 
  secondSet$Company) -> secondSet$Company

merge(firstSet, secondSet, by = c('Company'))

# output 
#       Company Year         X Income    MATCH
# 1 BPET LIMITED 2017 0.6576904     62 BPET LTD

很容易进行修改,以便在输出中获得空行。