加入模式匹配

时间:2017-04-12 16:18:21

标签: r join data.table

我有以下数据表:

> measures
    source     measure
1:   my123  0.08130182
2:   123my -1.45285168
3: your123 -0.30460771
4: 123your  0.94670380
5: 12your3 -0.54728546
> sources
          name pattern
1:   My Source      my
2: Your Source    your

使用

创建
measures <- data.table(source=c('my123', '123my', 'your123', '123your', '12your3'), measure=rnorm(5))
sources <- data.table(name=c('My Source', 'Your Source'), pattern=c('my', 'your'))

我希望能够加入like(measures.source, sources.pattern)。有没有一种好方法可以做到这一点(无需交叉连接和过滤不匹配的行。这对我的数据集来说是不切实际的)

我可以在SQL(PostgreSQL,见下文)中做到这一点,但我想知道有没有办法在R&#39; data.table或任何计划引入更多自定义加入功能在将来。

drop table if exists measures;
create table measures as (select * from (values
  ('my123', 0.08130182),
  ('123my', -1.45285168),
  ('your123', -0.30460771),
  ('123your', 0.94670380),
  ('your123', 0.94670380)
)t(source, measure));

drop table if exists sources;
create table sources as (select * from (values
  ('My Source',  'my'),
  ('Your Sources', 'your')
)t(name, pattern));

select * from measures join sources on measures.source ~ sources.pattern;

然后返回所需的:

source  |   measure   |     name     | pattern
--------+-------------+--------------+---------
my123   |  0.08130182 | My Source    | my
123my   | -1.45285168 | My Source    | my
your123 | -0.30460771 | Your Sources | your
123your |  0.94670380 | Your Sources | your
your123 |  0.94670380 | Your Sources | your

1 个答案:

答案 0 :(得分:0)

我不确定这是否属于&#34;不切实际的&#34;或不是,但这样做......为了您的目的,更复杂的模式匹配stringi将处理整理器。

> rbind.pages(lapply(1:nrow(measures), function(i){
       matched_slice <- which(stri_detect_regex(measures[i,1],sources$pattern))
       data.frame(measures[i,], sources[matched_slice, ])
  }))
   source     measure        name pattern
1   my123  0.75119183   My Source      my
2   123my  0.55344334   My Source      my
3 your123 -0.03498414 Your Source    your
4 123your  0.09364795 Your Source    your
5 12your3  0.47537732 Your Source    your

对于较大的数据集,请使用parallel::mclapplydata.table - 以这种方式运行:

rbindlist(lapply(1:nrow(measures), function(i){
    matched_slice <- which(stri_detect_regex(measures[i,1],sources$pattern))
    cbind(measures[i,], sources[matched_slice, ])
}))