Question

我有一个数据帧df，其中包含接近100,000的数据，显示了我的程序联系人列表。该列表中有一列显示与联系人关联的计划program和组织OrgName。它还有一组三列，显示联系人是否与列名称中标识的角色相关联：Role_Primary，Role_Comms，Role_Signatory。每当联系人在多个程序中或在程序中具有多个角色时，将为该联系人创建另一行，其中程序和联系人角色字段值会发生变化。见下面的例子。

First   Last    C_ID    OrgName O_ID Program    Role_Primary    Role_Comms  Role_Signatory
John    Smith   10045   Acme    901  Buildings  X       
John    Smith   10045   Acme    901  Buildings                  X   
John    Smith   10045   Acme    901  Homes      X       
Teddy   Bush    10046   Acme    901  Buildings  X       
Teddy   Bush    10046   Acme    901  Buildings                              X
Jess    Clinton 10050   Consult 904  Homes                                  X
Jess    Clinton 10050   Consult 904  Homes      X       
Jess    Clinton 10050   Consult 904  Homes      X

出于演示目的，我试图尽量减少行数。具体来说，如果联系人在同一个组织和同一个程序中，我只希望联系人出现在一行（此时为几行），并在相关列中标出联系人角色。见下文。

First   Last    C_ID    OrgName O_ID Program    Role_Primary    Role_Comms  Role_Signatory
John    Smith   10045   Acme    901  Buildings  X               X   
John    Smith   10045   Acme    901  Homes      X       
Teddy   Bush    10046   Acme    901  Buildings  X               X
Jess    Clinton 10050   Consult 904  Homes      X               X           X

要重新创建上面的两个表：

 table1<-structure(list(First = structure(c(2L, 2L, 2L, 3L, 3L, 1L, 1L, 
1L), .Label = c("Jess", "John", "Teddy"), class = "factor"), 
    Last = structure(c(3L, 3L, 3L, 1L, 1L, 2L, 2L, 2L), .Label = c("Bush", 
    "Clinton", "Smith"), class = "factor"), C_ID = c(10045L, 
    10045L, 10045L, 10046L, 10046L, 10050L, 10050L, 10050L), 
    OrgName = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Acme", 
    "Consult"), class = "factor"), O_ID = c(901L, 901L, 901L, 
    901L, 901L, 904L, 904L, 904L), Program = structure(c(1L, 
    1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Buildings", "Homes"
    ), class = "factor"), Role_Primary = structure(c(2L, 1L, 
    2L, 2L, 1L, 1L, 2L, 1L), .Label = c("", "X"), class = "factor"), 
    Role_Comms = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("", 
    "X"), class = "factor"), Role_Signatory = structure(c(1L, 
    1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", "X"), class = "factor")), .Names = c("First", 
"Last", "C_ID", "OrgName", "O_ID", "Program", "Role_Primary", 
"Role_Comms", "Role_Signatory"), class = "data.frame", row.names = c(NA, 
-8L))

table2<-structure(list(First = structure(c(2L, 2L, 3L, 1L), .Label = c("Jess", 
"John", "Teddy"), class = "factor"), Last = structure(c(3L, 3L, 
1L, 2L), .Label = c("Bush", "Clinton", "Smith"), class = "factor"), 
    C_ID = c(10045L, 10045L, 10046L, 10050L), OrgName = structure(c(1L, 
    1L, 1L, 2L), .Label = c("Acme", "Consult"), class = "factor"), 
    O_ID = c(901L, 901L, 901L, 904L), Program = structure(c(1L, 
    2L, 1L, 2L), .Label = c("Buildings", "Homes"), class = "factor"), 
    Role_Primary = structure(c(1L, 1L, 1L, 1L), .Label = "X", class = "factor"), 
    Role_Comms = structure(c(2L, 1L, 1L, 2L), .Label = c("", 
    "X"), class = "factor"), Role_Signatory = structure(c(1L, 
    1L, 2L, 2L), .Label = c("", "X"), class = "factor")), .Names = c("First", 
"Last", "C_ID", "OrgName", "O_ID", "Program", "Role_Primary", 
"Role_Comms", "Role_Signatory"), class = "data.frame", row.names = c(NA, 
-4L))

Answer 1

<强>输入

df <- data.table::fread("First      Last        C_ID        OrgName     O_ID    Program Role_Primary    Role_Comms  Role_Signatory
John        Smith       10045   Acme        901 Buildings   X       
John        Smith       10045   Acme        901 Buildings       X   
John        Smith       10045   Acme        901 Homes       X       
Teddy       Bush        10046   Acme        901 Buildings   X       
Teddy       Bush        10046   Acme        901 Buildings           X
Jess        Clinton     10050   Consult     904 Homes               X
Jess        Clinton     10050   Consult     904 Homes       X       
Jess        Clinton     10050   Consult     904 Homes       X       ")

合并行的代码：

library(dplyr)
library(tidyr)

df %>% 
  gather(Role, Member, Role_Primary:Role_Signatory) %>% 
  filter(!is.na(Member) & nchar(trimws(Member))>0) %>% 
  distinct() %>% 
  mutate(Role = factor(Role, unique(Role))) %>% 
  spread(Role, Member)

<强>输出

First    Last  C_ID OrgName O_ID   Program Role_Primary Role_Comms Role_Signatory
1  Jess Clinton 10050 Consult  904     Homes            X       <NA>              X
2  John   Smith 10045    Acme  901 Buildings            X          X           <NA>
3  John   Smith 10045    Acme  901     Homes            X       <NA>           <NA>
4 Teddy    Bush 10046    Acme  901 Buildings            X       <NA>              X

请注意distinct（）行是存在的，因为在输入示例中，Jess Clinton具有两次列出相同的角色。

Answer 2

我的方法是创建一个唯一标识符列，使用此列过滤保存为df2的唯一行，然后使用df中的完整数据填充{{1}中的缺失值1}}。

df2

使用library(dplyr) #make a uniqueID column by pasting any of the relevant unique values together df$uniqueID<-paste0(df$C_ID,df$OrgName, df$Program) #remove duplicate rows, store as df2 df2<-df[!duplicated(df$uniqueID),]的索引查找uniqueID中同一any()的{{1}}个X.

uniqueID

这将为您提供所需的输出：

df2

根据几个参数合并R中的行

2 个答案: