我有很长的字符串列表,它们共享子字符串。该列表来自事件流数据,因此有数万行,但我将简化此示例;宠物:
+--------------------------------+
| Pets |
+--------------------------------+
| "one calico cat that's smart" |
| "German Shepard dog" |
| "A Chameleon that is a Lizard" |
| "a cute tabby cat" |
| "the fish guppy" |
| "Lizard Gecko" |
| "German Shepard dog" |
| "Budgie Bird" |
| "Canary Bird in a coal mine" |
| "a chihuahua dog" |
+--------------------------------+
dput output: structure(list(Pets = structure(c(8L, 6L, 1L, 3L, 9L, 7L, 6L, 4L, 5L, 2L),.Label = c("A Chameleon that is a Lizard", "a chihuahua dog", "a cute tabby cat", "Budgie Bird", "Canary Bird in a coal mine", "German Shepard dog", "Lizard Gecko", "one calico cat that's smart", "the fish guppy"), class = "factor")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame")
我想根据泛型类型的宠物(狗,猫等)添加信息,我有一个包含此信息的密钥表:
+----------+----------------+
| key | classification |
+----------+----------------+
| "dog" | "canine" |
| "cat" | "feline" |
| "lizard" | "reptile" |
| "bird" | "avian" |
| "fish" | "fish" |
+----------+----------------+
dput output: structure(list(key = structure(c(3L, 2L, 5L, 1L, 4L), .Label = c("bird", "cat", "dog", "fish", "lizard"), class = "factor"), classification = structure(c(2L, 3L, 5L, 1L, 4L), .Label = c("avian", "canine", "feline", "fish", "reptile"), class = "factor")), .Names = c("key", "classification"), row.names = c(NA, -5L), class = "data.frame")
如何使用Pets
表中的“长字符串”在密钥表中查找相关的classification
?问题是,我的查找字符串包含在密钥表中找到的子字符串。
我开始使用这样的grepl:
key[grepl(pets[1,1], key[ , 2]), ]
但这不起作用,因为“印花布猫”不在关键列表中,尽管“猫”是。我正在寻找的结果将是“feline
”。
(注意:我不能简单地切换一下,因为在我自己的代码中,它位于一个apply函数中并循环遍历数据中的每一行。所以,而不是pets[1,1]
它是{{1}最后,我打算将pets[n,1]
结果放到事件流数据上进行进一步分析。)
我无法绕过如何做到这一点。有什么建议吗?
答案 0 :(得分:2)
您可以使用包fuzzyjoin轻松完成这些操作。
在这里你可以使用regex_left_join
,它的工作原理就像普通的左连接(例如dplyr::left_join
),除了rwos匹配的标准是由正则表达式匹配确定的{{3 }}
library(tibble)
library(fuzzyjoin)
pets <- tribble(
~pets,
"one calico cat that\'s smart",
"German Shepard dog",
"A Chameleon that is a Lizard",
"a cute tabby cat",
"the fish guppy",
"Lizard Gecko",
"German Shepard dog",
"Budgie Bird",
"Canary Bird in a coal mine",
"a chihuahua dog"
)
key <- tribble(
~key, ~classification,
"dog", "canine",
"cat", "feline",
"lizard", "reptile",
"bird", "avian",
"fish", "fish"
)
regex_left_join(pets, key, by = c("pets" = "key"), ignore_case = TRUE)
#> # A tibble: 10 x 3
#> pets key classification
#> <chr> <chr> <chr>
#> 1 one calico cat that's smart cat feline
#> 2 German Shepard dog dog canine
#> 3 A Chameleon that is a Lizard lizard reptile
#> 4 a cute tabby cat cat feline
#> 5 the fish guppy fish fish
#> 6 Lizard Gecko lizard reptile
#> 7 German Shepard dog dog canine
#> 8 Budgie Bird bird avian
#> 9 Canary Bird in a coal mine bird avian
#> 10 a chihuahua dog dog canine
答案 1 :(得分:1)
您可以为每个宠物构建关键列表,然后在表格中查找
Pattern = paste(KeyTable$key, collapse="|")
Pattern = paste0(".*(", Pattern, ").*")
Type = tolower(sub(Pattern, "\\1", ignore.case=TRUE, Pets))
KeyTable$classification[match(Type, KeyTable$key)]
[1] "feline" "canine" "reptile" "feline" "feline" "canine" "fish"
[8] "reptile" "canine" "avian" "avian" "canine"
数据
KeyTable = read.table(text="key classification
dog canine
cat feline
lizard reptile
bird avian
fish fish",
header=TRUE, stringsAsFactors=FALSE)
Pets = c("calico cat",
"Shepard dog" ,
"Chameleon Lizard",
"calico cat",
"tabby cat",
"chihuahua dog",
"guppy fish",
"Gecko Lizard",
"Shepard dog",
"Budgie Bird",
"Canary Bird" ,
"chihuahua dog")
答案 2 :(得分:1)
以下是使用hashmap
的另一种方法:
library(hashmap)
hash_table = hashmap(Lookup$key, Lookup$classification)
Pets %>%
separate_rows(Pets, sep = " ") %>%
mutate(class = hash_table[[tolower(Pets)]]) %>%
na.omit() %>%
select(Key = Pets, class) %>%
bind_cols(Pets, .)
<强>结果:强>
> hash_table
## (character) => (character)
## [fish] => [fish]
## [bird] => [avian]
## [lizard] => [reptile]
## [cat] => [feline]
## [dog] => [canine]
Pets Key class
1 one calico cat that's smart cat feline
2 German Shepard dog dog canine
3 A Chameleon that is a Lizard Lizard reptile
4 a cute tabby cat cat feline
5 the fish guppy fish fish
6 Lizard Gecko Lizard reptile
7 German Shepard dog dog canine
8 Budgie Bird Bird avian
9 Canary Bird in a coal mine Bird avian
10 a chihuahua dog dog canine
数据:强>
Pets = structure(list(Pets = c("one calico cat that's smart", "German Shepard dog",
"A Chameleon that is a Lizard", "a cute tabby cat", "the fish guppy",
"Lizard Gecko", "German Shepard dog", "Budgie Bird", "Canary Bird in a coal mine",
"a chihuahua dog")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame")
Lookup = structure(list(key = c("dog", "cat", "lizard", "bird", "fish"),
classification = c("canine", "feline", "reptile", "avian",
"fish")), class = "data.frame", .Names = c("key", "classification"
), row.names = c(NA, -5L))