Question

对于两个示例数据帧：

df1 <- structure(list(name = c("Katie", "Eve", "James", "Alexander", 
"Mary", "Barrie", "Harry", "Sam"), postcode = c("CB12FR", "CB12FR", 
"NE34TR", "DH34RL", "PE46YH", "IL57DS", "IP43WR", "IL45TR")), .Names = c("name", 
"postcode"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-8L), spec = structure(list(cols = structure(list(name = structure(list(), class = c("collector_character", 
"collector")), postcode = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("name", "postcode")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

df2 <-structure(list(name = c("Katie", "James", "Alexander", "Lucie", 
"Mary", "Barrie", "Claire", "Harry", "Clare", "Hannah", "Rob", 
"Eve", "Sarah"), postcode = c("CB12FR", "NE34TR", "DH34RL", "DL56TH", 
"PE46YH", "IL57DS", "RE35TP", "IP43WQ", "BH35OP", "CB12FR", "DL56TH", 
"CB12FR", "IL45TR"), rating = c(1L, 1L, 1L, 2L, 3L, 1L, 4L, 2L, 
2L, 3L, 1L, 4L, 2L)), .Names = c("name", "postcode", "rating"
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-13L), spec = structure(list(cols = structure(list(name = structure(list(), class = c("collector_character", 
"collector")), postcode = structure(list(), class = c("collector_character", 
"collector")), rating = structure(list(), class = c("collector_integer", 
"collector"))), .Names = c("name", "postcode", "rating")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

我希望合并两个数据帧，因此df2上的等级被添加到df1。我通常会使用：

ratings.df＆lt; - merge（x = df1，y = df2，by =“postcode”，all.x = TRUE）

无论其.... 我希望只在以下情况下合并： 1. df2中的邮政编码是唯一的（即，如果每个名称（或不同的名称）有多个邮政编码，则不会合并这些邮政编码）。 2.两个数据框中名称的前三个字母相同。

（我很高兴没有评级的邮政编码空白（我可以手动完成这些）。

这可能吗？

Answer 1

为什么不使用sqldf包裹？您可以使用此包合并R中的data.frames。通过使用JOIN语句来执行此操作。

就条件合并而言，这可以通过在SQL中使用CASE语句来实现。

因此，对于您的第一个条件，您可以使用CASE和COUNT(postcode) = ‘1’所在的GROUP BY name，这样，对于分配了1个邮政编码的每个名称，您可以{ {1}}。

另一个选择是JOIN使用gather。

Answer 2

使用dplyr解决方案，我们可以先消除df2$postcode中的重复项，然后将数据框加入df1：

library(dplyr)
df3 <- df2 %>%
  distinct(postcode, .keep_all = TRUE)

df1 %>%
  left_join(df3, by = c("postcode")) %>%
  filter(substr(name.x, 1, 3) == substr(name.y, 1, 3)) %>%
  rename(name = name.x) %>%
  mutate(name.y = NULL)

<小时/> 这将产生

# A tibble: 5 x 3
  name      postcode rating
  <chr>     <chr>     <int>
1 Katie     CB12FR        1
2 James     NE34TR        1
3 Alexander DH34RL        1
4 Mary      PE46YH        3
5 Barrie    IL57DS        1

这是你想要达到的目标吗？

根据条件

2 个答案: