如何从字符串中查找特定单词并通过这些单词合并变量

时间:2018-09-28 06:57:07

标签: r

假设我有一个数据“ A”,例如:

Disease_name
(J189)Pneumonia, unspecified
(R51)Headache
(M4806)Spinal stenosis, lumbar region
(M512)Other specified intervertebral disc displacement
(C187)Sigmoid colon
(N201)Calculus of ureter
(C189)Colon, unspecified
(S0600)Concussion, without open intracranial wound
(C73)Malignant neoplasm of thyroid gland
(C509)Breast, unspecified
(K746)Other and unspecified cirrhosis of liver
(B181)Chronic viral hepatitis B without delta- agent
(R42)Dizziness and giddiness

和另一个数据集B相似:

parts         key
Chest       pneumonia
Head        headache
Abdominal   spinal
Abdominal   intervetebral
Abdominal   colon
Abdominal   ureter
Abdominal   colon
Head        concussion
Neck        thyroid
Chest       breast
Abdominal   liver
Abdominal   hepatitis
Head        giddiness

我想从B$key中找到A&disease_name的单词,并用那些匹配的关键字将A合并到B,以便将B$parts分配给A&disease_name

如何在R中做到这一点?

2 个答案:

答案 0 :(得分:1)

欢迎您!这个问题对我来说很清楚。这是一个tidyverse解决方案。

首先读取一些数据:

library(dplyr)

tmp <- data.table::fread(
"Disease_name
(J189)Pneumonia, unspecified
(R51)Headache
(M4806)Spinal stenosis, lumbar region
(M512)Other specified intervertebral disc displacement
(C187)Sigmoid colon
(N201)Calculus of ureter
(C189)Colon, unspecified
(S0600)Concussion, without open intracranial wound
(C73)Malignant neoplasm of thyroid gland
(C509)Breast, unspecified
(K746)Other and unspecified cirrhosis of liver
(B181)Chronic viral hepatitis B without delta- agent
(R42)Dizziness and giddiness",
sep = ""
)


tmp2 <- data.table::fread(
  "parts  key
Chest   pneumonia
Head    headache
Abdominal   spinal
Abdominal   intervertebral
Abdominal   colon
Abdominal   ureter
Abdominal   colon
Head    concussion
Neck    thyroid
Chest   breast
Abdominal   liver
Abdominal   hepatitis
Head    giddiness"
)

然后我们进行联接:

result <-
  tmp %>%
  mutate(key = gsub(paste0(".*(", paste(tmp2$key, collapse = "|"), ").*"),
  "\\1",
  tolower(tmp$Disease_name))) %>%
  left_join(tmp2)
#> Joining, by = "key"

结果:

result
#>                                              Disease_name            key
#> 1                            (J189)Pneumonia, unspecified      pneumonia
#> 2                                           (R51)Headache       headache
#> 3                   (M4806)Spinal stenosis, lumbar region         spinal
#> 4  (M512)Other specified intervertebral disc displacement intervertebral
#> 5                                     (C187)Sigmoid colon          colon
#> 6                                     (C187)Sigmoid colon          colon
#> 7                                (N201)Calculus of ureter         ureter
#> 8                                (C189)Colon, unspecified          colon
#> 9                                (C189)Colon, unspecified          colon
#> 10     (S0600)Concussion, without open intracranial wound     concussion
#> 11               (C73)Malignant neoplasm of thyroid gland        thyroid
#> 12                              (C509)Breast, unspecified         breast
#> 13         (K746)Other and unspecified cirrhosis of liver          liver
#> 14   (B181)Chronic viral hepatitis B without delta- agent      hepatitis
#> 15                           (R42)Dizziness and giddiness      giddiness
#>        parts
#> 1      Chest
#> 2       Head
#> 3  Abdominal
#> 4  Abdominal
#> 5  Abdominal
#> 6  Abdominal
#> 7  Abdominal
#> 8  Abdominal
#> 9  Abdominal
#> 10      Head
#> 11      Neck
#> 12     Chest
#> 13 Abdominal
#> 14 Abdominal
#> 15      Head

reprex package(v0.2.1)于2018-09-28创建

答案 1 :(得分:1)

假设您有两个数据帧AB,则可以使用sqldf将两者合并成一个新的数据帧C,如下所示:

C = sqldf("SELECT B.parts, A.Disease_name 
           FROM A, B 
           WHERE LOCATE(B.key, A.Disease_name) > 0")

我目前无法访问有效的R/sqldf环境,因此您可能仍需要对SQL语句进行一些调整。