我有两个像这样的dfs:
DF1
name <- c("Ted","Bill","James","Randy","Mark","Jimmy","Eric","Allen")
team <- c("Hawks","Tigers","Bears","Tigers","Lions","Bears","Hawks","Lions")
df1 <- data.frame(name,team)
DF2
name <- c("Ted","Bill","Mark","Jimmy","Eric","James","Allen","Randy","Bill","James","Mark")
team <- c("Hawks","Tigers","Lions","Bears","Hawks","Bears","Lions","Tigers","Tigers","Bears","Lions")
game_id <- c("21","23","28","21","21","21","29","22","22","32","42")
df2 <- data.frame(name,team,game_id)
如果game_id在df1中没有其各自团队的所有名称,我想用NA
标记df2中的game_ids。例如,在我提供的示例数据中,包含“James”和“Bears”的行中的game_id 32将是标记为NA
的game_id之一,因为在df2中没有为game_id 32表示“Jimmy”。我们知道Jimmy必须有代表,因为他在df1中连续出现,并为他的团队指定了“Bears”。
我的样本数据的所需输出如下所示:
DF3
name <- c("Ted","Bill","Mark","Jimmy","Eric","James","Allen","Randy","Bill","James","Mark")
team <- c("Hawks","Tigers","Lions","Bears","Hawks","Bears","Lions","Tigers","Tigers","Bears","Lions")
game_id <- c("21",NA,NA,"21","21","21",NA,"22","22",NA,NA)
df3 <- data.frame(name,team,game_id)
我认为解决方案首先要传播df1(在添加唯一ID列之后),如下所示:
df1$row_index <- seq.int(nrow(df1))
df1 <- spread(df1,team,name)
但是在那之后我就陷入了困境。这样做的最佳方法是什么?
答案 0 :(得分:2)
你应该可以通过对所有正确的团队/名称组合的“反加入”来做到这一点:
badgames <- df1 %>%
full_join(distinct(select(df2, game_id, team)), by="team") %>%
anti_join(df2, by=c("team", "game_id", "name")) %>%
select(game_id,team) %>%
mutate(hit = 1)
df2 %>%
left_join(badgames, by=c("game_id","team")) %>%
mutate(game_id = replace(game_id, hit==1, NA), hit = NULL)
相同的逻辑适用于data.table
键控连接,您可以通过将!
放在连接表的前面来指定反连接。您还可以使用:=
在同一步骤中执行所有更新,而不是创建中间数据集:
library(data.table)
setDT(df1)
setDT(df2)
df2[
df1[unique(df2[, .(game_id,team)]), on=.(team)][
!df2, on=.(game_id, team, name)], on=.(game_id,team),
game_id := NA
]
两者都导致:
# name team game_id
#1 Ted Hawks 21
#2 Bill Tigers <NA>
#3 Mark Lions <NA>
#4 Jimmy Bears 21
#5 Eric Hawks 21
#6 James Bears 21
#7 Allen Lions <NA>
#8 Randy Tigers 22
#9 Bill Tigers 22
#10 James Bears <NA>
#11 Mark Lions <NA>
答案 1 :(得分:1)
这是使用计数的另一种方式。我们将df1
中每个团队的玩家数量与df2
中每个团队的每个游戏玩家数量进行比较。如果df1
是一个不完整的玩家列表,例如,这可能会被绊倒。如果狮友在df1
中有两名球员,并且在df2
的比赛中有两名完全不同的球员为他们效力,但如果我理解不应该是这样的设置。
library(tidyverse)
df1 <- tibble(
name = c("Ted","Bill","James","Randy","Mark","Jimmy","Eric","Allen"),
team = c("Hawks","Tigers","Bears","Tigers","Lions","Bears","Hawks","Lions")
)
df2 <- tibble(
name = c("Ted","Bill","Mark","Jimmy","Eric","James","Allen","Randy","Bill","James","Mark"),
team = c("Hawks","Tigers","Lions","Bears","Hawks","Bears","Lions","Tigers","Tigers","Bears","Lions"),
game_id = c("21","23","28","21","21","21","29","22","22","32","42")
)
df2 %>%
add_count(team, game_id) %>%
left_join(add_count(df1, team), by = c("name", "team")) %>%
mutate(game_id = ifelse(n.x == n.y, game_id, NA)) %>%
select(name:game_id)
#> # A tibble: 11 x 3
#> name team game_id
#> <chr> <chr> <chr>
#> 1 Ted Hawks 21
#> 2 Bill Tigers <NA>
#> 3 Mark Lions <NA>
#> 4 Jimmy Bears 21
#> 5 Eric Hawks 21
#> 6 James Bears 21
#> 7 Allen Lions <NA>
#> 8 Randy Tigers 22
#> 9 Bill Tigers 22
#> 10 James Bears <NA>
#> 11 Mark Lions <NA>
由reprex package(v0.2.0)创建于2018-04-10。
答案 2 :(得分:0)
这是一种方法:
"content-type-mismatch"
答案 3 :(得分:0)
其他解决方案使用计算玩家数量,这可能无法准确捕捉到当相同数量的玩家,但正在玩不同玩家时你正在看的情景。
因此,如果您想要了解正在玩的玩家,您可能希望以排序方式连接所有玩家名称并进行比较。
name <- c("Ted","Bill","James","Randy","Mark","Jimmy","Eric","Allen")
team <- c("Hawks","Tigers","Bears","Tigers","Lions","Bears","Hawks","Lions")
df1 <- data.frame(name,team)
name <- c("Ted","Bill","Mark","Jimmy","Eric","James","Allen","Randy","Bill","James","Mark")
team <- c("Hawks","Tigers","Lions","Bears","Hawks","Bears","Lions","Tigers","Tigers","Bears","Lions")
game_id <- c("21","23","28","21","21","21","29","22","22","32","42")
# Note the game_id needs to be a string, otherwise the NAs may be improperly captured
df2 <- data.frame(name,team,game_id, stringsAsFactors = FALSE)
# Concatenate all players names by group in df1
df1.all.members <- df1 %>%
group_by(team) %>%
arrange(name) %>%
summarise(all_players = paste0(name, collapse = "_"))
# Perform the same concatenation in df2
df2.all.members <- df2 %>%
group_by(team, game_id) %>%
arrange(name) %>%
mutate(all_players2 = paste0(name, collapse = "_")) %>%
# Left join with the new df1
left_join(df1.all.members, by = "team") %>%
ungroup %>%
# Compare if all names are the same
mutate(game_id = ifelse(all_players2 == all_players, game_id, NA)) %>%
# Select required fields
select(name, team, game_id)
# # A tibble: 11 x 3
# name team game_id
# <chr> <chr> <chr>
# 1 Allen Lions <NA>
# 2 Bill Tigers <NA>
# 3 Bill Tigers 22
# 4 Eric Hawks 21
# 5 James Bears 21
# 6 James Bears <NA>
# 7 Jimmy Bears 21
# 8 Mark Lions <NA>
# 9 Mark Lions <NA>
# 10 Randy Tigers 22
# 11 Ted Hawks 21
答案 4 :(得分:0)
使用sqldf
,您可以跳过恼人的NA
替换。
library(dplyr)
library(sqldf)
dfx <- inner_join(count(df2,game_id,team),count(df1,team))
sqldf("SELECT name, team, dfx.game_id from df2 natural left join dfx")
# or finish the dplyr chain with:
# %>% right_join(df2) %>% mutate(game_id = `is.na<-`(game_id,is.na(n))) %>% select(-n)
# name team game_id
# 1 Ted Hawks 21
# 2 Bill Tigers <NA>
# 3 Mark Lions <NA>
# 4 Jimmy Bears 21
# 5 Eric Hawks 21
# 6 James Bears 21
# 7 Allen Lions <NA>
# 8 Randy Tigers 22
# 9 Bill Tigers 22
# 10 James Bears <NA>
# 11 Mark Lions <NA>
data.table
也有此功能:
setDT(df1)
setDT(df2)
dfx <- df2[,.N, by=c("team","game_id")][df1[,.N, by=team],on=c("team","N")]
dfx[df2,.(name,team,game_id=x.game_id),on=c("team","game_id")]
# name team game_id
# 1: Ted Hawks 21
# 2: Bill Tigers NA
# 3: Mark Lions NA
# 4: Jimmy Bears 21
# 5: Eric Hawks 21
# 6: James Bears 21
# 7: Allen Lions NA
# 8: Randy Tigers 22
# 9: Bill Tigers 22
# 10: James Bears NA
# 11: Mark Lions NA
完整性的基本版本,请注意,可以合并tables
而不首先将它们转换为data.frame
:
dfx <- merge(table(df2[-1]),table(df1[-1],dnn=names(df1[-1])))
df3 <- merge(df2,dfx,all.x=T)
is.na(df3$game_id) <- is.na(df3$n)
df3 <- df3[-4]
# team game_id name
# 1 Bears 21 Jimmy
# 2 Bears 21 James
# 3 Bears <NA> James
# 4 Hawks 21 Ted
# 5 Hawks 21 Eric
# 6 Lions <NA> Mark
# 7 Lions <NA> Allen
# 8 Lions <NA> Mark
# 9 Tigers 22 Randy
# 10 Tigers 22 Bill
# 11 Tigers <NA> Bill