R - 有效地查找具有几乎相同数据的行,并将差异粘贴到一个单元格中

时间:2015-08-11 00:44:01

标签: r

假设我有一个数据框

 Data <- data.frame("Name", "Age", "Weight", "School", "Book" , "Author")
 Data[1,] <- c("Paul", 26, 150, "Helgason U", "Intro to Smooth Manifolds", "John Lee")
 Data[2,] <- c("Paul", 26, 150, "Helgason U", "A Tale of Two Cities", "Charles Dickens")
 Data[3,] <- c("Paul", 26, 150, "Helgason U", "Fear and Loathing in Las Vegas", "Hunter Thompson")
 Data[4,] <- c("Paul", 26, 150, "Helgason U", "Gravity's Rainbow", "Thomas Pynchon")
 Data[5,] <- c("David", 35, 165, "Turing College", "Brave New World", "Aldous Huxley")
 Data[6,] <- c("David", 35, 165, "Turing College", "Vashista's Yoga", "Vashista")
 Data[7,] <- c("David", 35, 165, "Turing College", "C++ For Dummies", "Anonymous")

我希望压缩数据,以便对应于同一个人的所有行可以放入一行,并且可以连接多个书籍和作者。换句话说,我希望我的输出是:

    Name     Age     Weight     School     Books                          Authors
    Paul     26       150     Helgason U   Intro to Smooth Manifolds      John Lee
                                           A Tale of Two Cities           Charles Dickens
                                           Fear and Loathing in Las Vegas Hunter Thompson
                                           Gravity's Rainbow              Thomas Pynchon
    David    35       165   Turing College Brave New World                Aldous Huxley
                                           Vashista's Yoga                Vashista
                                           C++ For Dummies                Anonymous

理想情况下,我希望这些书可以连接为"Intro to Smooth Manifolds\nA Tale of Two Cities\nFear and Loathing in Las Vegas\nGravity's Rainbow"

最初我使用了for循环,但这太慢了,因为我的实际数据远远大于此。要了解我是如何循环的:

  for (i in 1:L){
    Names = subset(Data, Data$Name == unique(Data$Names)[i])
    rows = nrow(Names)

    Name_Matches = which(duplicated(Names[,Cols]) | duplicated(Names[nrow(Names):1, Cols])[nrow(Names):1])
    Name_UnMtchs = setdiff(1:nrow(Names), Name_Matches)

    Books        = Names$Book[Name_Matches]
    New_Books    = paste(as.character(Books), collapse = "\n")
    Authors     = Names$Author[Name_Matches]
    New_Authors = paste(Authors, collapse = "\n")

    New_Data[count_New, Cols] = Names[Name_Matches[1], Cols]
    New_Data$Book             = New_Books
    New_Data$Author           = New_Authors
    count_New                 = count_New + 1
    }

其中Cols是条件的列索引,我知道对于某个人(年龄,体重,学校,姓名)保持不变,L是数据框中唯一名称的数量,count_New是一个在1处初始化的计数器,而New_Data是一个空数据框,其列与Data相同。我可以使用哪种功能,这样可以在不使用这种for循环的情况下整合我的数据?

2 个答案:

答案 0 :(得分:3)

这种东西可以用基础R来完成,但是最好使用专门为数据争用而设计的包。

在dplyr:

require(dplyr)

Data %>%
  group_by(Name, Age, Weight, School) %>%
  summarise(Books=paste(Book, collapse="\n"), Authors=paste(Author, collapse="\n"))

我怀疑这是你真正想要的。不是将书名(和作者)粘贴到每个名称的一个字符串中,而是将它们变成标题的向量,然后可以用于进一步处理。

Data %>%
  group_by(Name, Age, Weight, School) %>%
  summarise(Books=list(Book), Authors=list(Author))

答案 1 :(得分:1)

考虑这个基础R解决方案(尽管效率不高或优雅):

# OBTAIN UNIQUE PERSONS DATAFRAME
Data2 <- unique(Data[1:4])
rownames(Data2) <- NULL

# GET LIST OF DISTINCT PERSONS
persons = c(Data2[1]) 

# LOOP THROUGH DISTINCT PERSONS
for (j in persons){
  for (k in 0:length(persons)+1){
  # BOOK COLUMN (PULL INTO LIST, THEN CONCATENATE)  
  books <- c(Data[Data$Name==j[k],][5])
  booksconcat <- paste(books[[1]], collapse="\n")
  Data2$Book[Data2$Name==j[k]] <- booksconcat    

  # AUTHOR COLUMN (PULL INTO LIST, THEN CONCATENATE)
  authors <- c(Data[Data$Name==j[k],][6])
  authorsconcat <- paste(authors[[1]], collapse="\n")
  Data2$Author[Data2$Name==j[k]] <- authorsconcat    
  }
}