partial grepl导致匹配关键字与多列的文本字符串

时间:2018-04-13 17:35:13

标签: r grep

我有一份诊断清单,我想根据关键字对它们进行分组。因此,如果在ref[[1]]中找到mh$prb中的某个关键字,那么mh$group会得到1.我遇到grepl的问题是我的一些问题关键词越来越匹配而其他关键词则不匹配 - 即使它们存在。我在ref中有关键字:

要分配诊断组,我执行了以下using this example

mh$group <- ifelse(grepl(ref[[1]], mh$prb), 1, 
                   ifelse(grepl(ref[[2]], mh$prb), 2,
                          ifelse(grepl(ref[[3]], mh$prb), 3,
                                 ifelse(grepl(ref[[4]], mh$prb), 4,
                                        ifelse(grepl(ref[[5]], mh$prb), 5,
                                               ifelse(grepl(ref[[6]], mh$prb), 6,
                                                      ifelse(grepl(ref[[7]], mh$prb), 7, 0
                                 )))))))

而且,正如您所看到的,我有一个部分匹配,其中一些关键字被标记,而其他关键字没有。例如,'抑郁'被分配,而'双极'则没有。

> head(mh)
  prb                                             group
  <chr>                                           <dbl>
1 unspecified major depression  single episode     2.00
2 bipolar disorder  unspecified                    0   
3 unspecified major depression  recurrent episode  2.00
4 bipolar disorder unspecified                     0   
5 alcohol abuse unspecified                        7.00
6 cocaine dependence  uncomplicated                0

所以我隔离了一个测试示例。您可以看到t df有bipolarref也是如此。

> t <- filter(mh, prb == "bipolar disorder  unspecified")
> ref[[2]]
[1] "major| depression| depressive| bipolar| manic| mood| substance induced mood| substance induced mood| alcohol induced mood| alcohol induced mood| cocaine induced mood| cocaine induced mood| amphetamine induced mood| amphetamine induced mood| opioid induced mood| opioid induced mood| cannabis induced mood| cannabis induced mood| marijuana induced mood| marijuana induced mood| methamphetamine induced mood| methamphetamine induced mood| sedative| hypnotic anxiolytic induced mood"
> grepl("bipolar", t$prb)
[1] TRUE
> grepl("bipolar", ref[[2]])
[1] TRUE
> grepl(t$prb, ref[[2]])
[1] FALSE
> grepl(ref[[2]], t$prb)
[1] FALSE

因此,“bipolar”对于ref[[2]]t$prb分别为TRUE,但在进行比较时不为TRUE。我搞砸了哪里?

修改

> dput(ref)
c("psychotic| schizophrenia| schizo| psychosis| delusional| delusion| paranoid| undifferentiated| disorganized| substance induced psychotic| substance induced psychosis| alcohol induced psychotic| alcohol induced psychosis| cocaine induced psychosis| cocaine induced psychotic| amphetamine induced psychosis| amphetamine induced psychotic| opioid induced psychosis| opioid induced psychotic| cannabis induced psychosis| cannabis induced psychotic| marijuana induced psychosis| marijuana induced psychotic| methamphetamine induced psychosis| methamphetamine induced psychotic| hallucinogen induced psychosis| hallucinogen induced psychotic| PCP induced psychosis| PCP induced psychotic| benzodiazepine induced psychosis| benzodiazepine induced psychotic| phencyclidine induced psychosis| phencyclidine induced psychotic", 
"major| depression| depressive| bipolar| manic| mood| substance induced mood| substance induced mood| alcohol induced mood| alcohol induced mood| cocaine induced mood| cocaine induced mood| amphetamine induced mood| amphetamine induced mood| opioid induced mood| opioid induced mood| cannabis induced mood| cannabis induced mood| marijuana induced mood| marijuana induced mood| methamphetamine induced mood| methamphetamine induced mood| sedative| hypnotic anxiolytic induced mood", 
"post| traumatic| PTSD| panic| intermittent| explosive", "borderline| schizoid| schizotypal| paranoid", 
"neuro| neurocognitive| cognitive| dementia| alzheimers| vascular", 
"autism| aspergers| spectrum| retardation| intellectual| disability", 
"alcohol| cannabis| marijuana| opioid| heroin| amphetamine| methamphetamine| cocaine| inhalant| hallucinogen| PCP| sedative| hypnotic| anxiolytic| benzodiazepine| Xanax| valium| phencyclidine| induced| substance induced| alcohol induced| cannabis induced| marijuana induced| opioid induced| heroin induced| amphetamine induced| methamphetamine induced| cocaine induced| inhalant induced| hallucinogen induced| PCP induced| sedative induced| hypnotic induced| anxiolytic induced| benzodiazepine induced| Xanax induced| valium induced| phencyclidine induced"
)

> dput(head(mh))
structure(list(prb = c("unspecified major depression  single episode", 
"bipolar disorder  unspecified", "unspecified major depression  recurrent episode", 
"bipolar disorder unspecified", "alcohol abuse unspecified", 
"cocaine dependence  uncomplicated")), .Names = "prb", row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

1 个答案:

答案 0 :(得分:1)

导致问题的原因是您的ref变量的定义方式。当您将指定为“| bipolar”时,grep正在查找空格,后跟“bipolar”一词,因此您缺少条件为第一个单词的所有匹配项。 要修复,请尝试使用“| bipolar”(将在复合词中找到条件)或“| bipolar”(将在句子中的最后一个单词之外找到单独的单词)。

现在要批量修复“ref”变量而不手动删除所有额外的空格,可以使用grep。 |是一个特殊的角色,需要双重逃脱。

ref<-gsub("\\| ", "\\|", ref)

#For example
ref[5]
  

[1]   “神经认知|认知|痴呆|老年痴呆症|血管”

现在:

ifelse(grepl(ref[[1]], mh$prb), 1,....  )))))))

将产生:

[1] 2 2 2 2 7 7