从一个表到另一个表变量的匹配观察由字符串组成

时间:2016-09-20 00:13:41

标签: r stringr

我有两个名为A和B的数据集。

 library(data.table)
 Farm.Type <- c("Fruits","Vegetables","Livestock")
 Produce.All <- c("Apple, Orange, Pears, Strawberries","Broccoli, Cabbage, Spinach","Cow, Pig, Chicken")

 Store <- c("Convenience","Wholesale","Grocery","Market")
 Produce <- c("Oranges","Watermelon","Cabbage","Pig")
 Farm <- c("Fruits","","Vegetables","Livestock")

 A <- data.table(Farm.Type, Produce.All)
 B <- data.table(Store, Produce)

我正在尝试确定表B中的Produce的Farm.Type属于表A,而不更改两个表的格式,以便将Farm.Type字段拉入表B.这样数据框看起来如此像

 C <- data.table(Store, Produce, Farm)

我尝试过以下方式使用%in%:

 B$Farm[B$Produce %in% A$Produce.All] <- A$Farm.Type

但是因为A $ Produce.All字段是带逗号的字符串,所以它不匹配。

有没有办法搜索字符串(A $ Produce.All)来查找B $ Produce的匹配?

感谢任何帮助。

感谢。

2 个答案:

答案 0 :(得分:2)

Farm.Type <- c("Fruits","Vegetables","Livestock")
Produce.All <- c("Apple, Oranges, Pears, Strawberries","Broccoli, Cabbage, Spinach","Cow, Pig, Chicken")

Store <- c("Convenience","Wholesale","Grocery","Market")
Produce <- c("Orange","Watermelon","Cabbage","Pig")
Farm <- c("Fruits","","Vegetables","Livestock")

这里不需要data.table,所以我要先使用它。转换数据要好得多,因为你必须做这样的旋转:

library(dplyr)
library(purrr)
library(stringi)

A <- data_frame(Farm.Type, Produce.All)
B <- data_frame(Store, Produce)

map(B$Produce, ~stri_detect_regex(A$Produce.All, sprintf("[[:space:],]*%s[[:space:],]*", .))) %>% 
  map(which) %>% 
  map_chr(~A$Farm.Type[ifelse(length(.)==0, NA, .)][1]) 

否则。 (您仍然需要将其添加到B数据框中)

library(purrr)
library(dplyr)
library(tidyr)

mutate(A, Produce.All=stri_split_regex(Produce.All, ", ")) %>% 
  unnest(Produce.All) -> A_long

left_join(B, A_long, by=c("Produce"="Produce.All"))

而且,我当然希望这不是作业。

答案 1 :(得分:2)

重复了hrbrmstr的答案,但坚持data.table和一些基础R:

longA <- 
  stack(
    setNames(
      strsplit(A[, Produce.All], ", "),
      A[, Farm.Type]
    )
  )

merge(longA, B, by.x = "values", by.y = "Produce", all.y = TRUE)
#      values        ind       Store
#1    Cabbage Vegetables     Grocery
#2    Oranges       <NA> Convenience
#3        Pig  Livestock      Market
#4 Watermelon       <NA>   Wholesale

# Or using a data.table merge, if you like
setDT(longA)[B, on = c(values = "Produce")]

当然,“Orange”与“Oranges”不匹配,并且每个数据集中多个和单个版本的项目的不一致外观使得合并更具挑战性。如果这也是需要完成的事情,我建议在进行合并之前将复数版本映射为单数。