我有一个数据帧df
,其中包含接近100,000的数据,显示了我的程序联系人列表。该列表中有一列显示与联系人关联的计划program
和组织OrgName
。它还有一组三列,显示联系人是否与列名称中标识的角色相关联:Role_Primary
,Role_Comms
,Role_Signatory
。每当联系人在多个程序中或在程序中具有多个角色时,将为该联系人创建另一行,其中程序和联系人角色字段值会发生变化。见下面的例子。
First Last C_ID OrgName O_ID Program Role_Primary Role_Comms Role_Signatory
John Smith 10045 Acme 901 Buildings X
John Smith 10045 Acme 901 Buildings X
John Smith 10045 Acme 901 Homes X
Teddy Bush 10046 Acme 901 Buildings X
Teddy Bush 10046 Acme 901 Buildings X
Jess Clinton 10050 Consult 904 Homes X
Jess Clinton 10050 Consult 904 Homes X
Jess Clinton 10050 Consult 904 Homes X
出于演示目的,我试图尽量减少行数。具体来说,如果联系人在同一个组织和同一个程序中,我只希望联系人出现在一行(此时为几行),并在相关列中标出联系人角色。见下文。
First Last C_ID OrgName O_ID Program Role_Primary Role_Comms Role_Signatory
John Smith 10045 Acme 901 Buildings X X
John Smith 10045 Acme 901 Homes X
Teddy Bush 10046 Acme 901 Buildings X X
Jess Clinton 10050 Consult 904 Homes X X X
要重新创建上面的两个表:
table1<-structure(list(First = structure(c(2L, 2L, 2L, 3L, 3L, 1L, 1L,
1L), .Label = c("Jess", "John", "Teddy"), class = "factor"),
Last = structure(c(3L, 3L, 3L, 1L, 1L, 2L, 2L, 2L), .Label = c("Bush",
"Clinton", "Smith"), class = "factor"), C_ID = c(10045L,
10045L, 10045L, 10046L, 10046L, 10050L, 10050L, 10050L),
OrgName = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Acme",
"Consult"), class = "factor"), O_ID = c(901L, 901L, 901L,
901L, 901L, 904L, 904L, 904L), Program = structure(c(1L,
1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Buildings", "Homes"
), class = "factor"), Role_Primary = structure(c(2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L), .Label = c("", "X"), class = "factor"),
Role_Comms = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("",
"X"), class = "factor"), Role_Signatory = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", "X"), class = "factor")), .Names = c("First",
"Last", "C_ID", "OrgName", "O_ID", "Program", "Role_Primary",
"Role_Comms", "Role_Signatory"), class = "data.frame", row.names = c(NA,
-8L))
table2<-structure(list(First = structure(c(2L, 2L, 3L, 1L), .Label = c("Jess",
"John", "Teddy"), class = "factor"), Last = structure(c(3L, 3L,
1L, 2L), .Label = c("Bush", "Clinton", "Smith"), class = "factor"),
C_ID = c(10045L, 10045L, 10046L, 10050L), OrgName = structure(c(1L,
1L, 1L, 2L), .Label = c("Acme", "Consult"), class = "factor"),
O_ID = c(901L, 901L, 901L, 904L), Program = structure(c(1L,
2L, 1L, 2L), .Label = c("Buildings", "Homes"), class = "factor"),
Role_Primary = structure(c(1L, 1L, 1L, 1L), .Label = "X", class = "factor"),
Role_Comms = structure(c(2L, 1L, 1L, 2L), .Label = c("",
"X"), class = "factor"), Role_Signatory = structure(c(1L,
1L, 2L, 2L), .Label = c("", "X"), class = "factor")), .Names = c("First",
"Last", "C_ID", "OrgName", "O_ID", "Program", "Role_Primary",
"Role_Comms", "Role_Signatory"), class = "data.frame", row.names = c(NA,
-4L))
答案 0 :(得分:0)
<强>输入强>
df <- data.table::fread("First Last C_ID OrgName O_ID Program Role_Primary Role_Comms Role_Signatory
John Smith 10045 Acme 901 Buildings X
John Smith 10045 Acme 901 Buildings X
John Smith 10045 Acme 901 Homes X
Teddy Bush 10046 Acme 901 Buildings X
Teddy Bush 10046 Acme 901 Buildings X
Jess Clinton 10050 Consult 904 Homes X
Jess Clinton 10050 Consult 904 Homes X
Jess Clinton 10050 Consult 904 Homes X ")
合并行的代码:
library(dplyr)
library(tidyr)
df %>%
gather(Role, Member, Role_Primary:Role_Signatory) %>%
filter(!is.na(Member) & nchar(trimws(Member))>0) %>%
distinct() %>%
mutate(Role = factor(Role, unique(Role))) %>%
spread(Role, Member)
<强>输出强>
First Last C_ID OrgName O_ID Program Role_Primary Role_Comms Role_Signatory
1 Jess Clinton 10050 Consult 904 Homes X <NA> X
2 John Smith 10045 Acme 901 Buildings X X <NA>
3 John Smith 10045 Acme 901 Homes X <NA> <NA>
4 Teddy Bush 10046 Acme 901 Buildings X <NA> X
请注意distinct()行是存在的,因为在输入示例中,Jess Clinton具有两次列出相同的角色。
答案 1 :(得分:0)
我的方法是创建一个唯一标识符列,使用此列过滤保存为df2
的唯一行,然后使用df
中的完整数据填充{{1}中的缺失值1}}。
df2
使用library(dplyr)
#make a uniqueID column by pasting any of the relevant unique values together
df$uniqueID<-paste0(df$C_ID,df$OrgName, df$Program)
#remove duplicate rows, store as df2
df2<-df[!duplicated(df$uniqueID),]
的索引查找uniqueID
中同一any()
的{{1}}个X.
uniqueID
这将为您提供所需的输出:
df2