假设我有一个数据框
Data <- data.frame("Name", "Age", "Weight", "School", "Book" , "Author")
Data[1,] <- c("Paul", 26, 150, "Helgason U", "Intro to Smooth Manifolds", "John Lee")
Data[2,] <- c("Paul", 26, 150, "Helgason U", "A Tale of Two Cities", "Charles Dickens")
Data[3,] <- c("Paul", 26, 150, "Helgason U", "Fear and Loathing in Las Vegas", "Hunter Thompson")
Data[4,] <- c("Paul", 26, 150, "Helgason U", "Gravity's Rainbow", "Thomas Pynchon")
Data[5,] <- c("David", 35, 165, "Turing College", "Brave New World", "Aldous Huxley")
Data[6,] <- c("David", 35, 165, "Turing College", "Vashista's Yoga", "Vashista")
Data[7,] <- c("David", 35, 165, "Turing College", "C++ For Dummies", "Anonymous")
我希望压缩数据,以便对应于同一个人的所有行可以放入一行,并且可以连接多个书籍和作者。换句话说,我希望我的输出是:
Name Age Weight School Books Authors
Paul 26 150 Helgason U Intro to Smooth Manifolds John Lee
A Tale of Two Cities Charles Dickens
Fear and Loathing in Las Vegas Hunter Thompson
Gravity's Rainbow Thomas Pynchon
David 35 165 Turing College Brave New World Aldous Huxley
Vashista's Yoga Vashista
C++ For Dummies Anonymous
理想情况下,我希望这些书可以连接为"Intro to Smooth Manifolds\nA Tale of Two Cities\nFear and Loathing in Las Vegas\nGravity's Rainbow"
。
最初我使用了for循环,但这太慢了,因为我的实际数据远远大于此。要了解我是如何循环的:
for (i in 1:L){
Names = subset(Data, Data$Name == unique(Data$Names)[i])
rows = nrow(Names)
Name_Matches = which(duplicated(Names[,Cols]) | duplicated(Names[nrow(Names):1, Cols])[nrow(Names):1])
Name_UnMtchs = setdiff(1:nrow(Names), Name_Matches)
Books = Names$Book[Name_Matches]
New_Books = paste(as.character(Books), collapse = "\n")
Authors = Names$Author[Name_Matches]
New_Authors = paste(Authors, collapse = "\n")
New_Data[count_New, Cols] = Names[Name_Matches[1], Cols]
New_Data$Book = New_Books
New_Data$Author = New_Authors
count_New = count_New + 1
}
其中Cols
是条件的列索引,我知道对于某个人(年龄,体重,学校,姓名)保持不变,L
是数据框中唯一名称的数量,count_New
是一个在1
处初始化的计数器,而New_Data
是一个空数据框,其列与Data
相同。我可以使用哪种功能,这样可以在不使用这种for循环的情况下整合我的数据?
答案 0 :(得分:3)
这种东西可以用基础R来完成,但是最好使用专门为数据争用而设计的包。
在dplyr:
require(dplyr)
Data %>%
group_by(Name, Age, Weight, School) %>%
summarise(Books=paste(Book, collapse="\n"), Authors=paste(Author, collapse="\n"))
我怀疑这是你真正想要的。不是将书名(和作者)粘贴到每个名称的一个字符串中,而是将它们变成标题的向量,然后可以用于进一步处理。
Data %>%
group_by(Name, Age, Weight, School) %>%
summarise(Books=list(Book), Authors=list(Author))
答案 1 :(得分:1)
考虑这个基础R解决方案(尽管效率不高或优雅):
# OBTAIN UNIQUE PERSONS DATAFRAME
Data2 <- unique(Data[1:4])
rownames(Data2) <- NULL
# GET LIST OF DISTINCT PERSONS
persons = c(Data2[1])
# LOOP THROUGH DISTINCT PERSONS
for (j in persons){
for (k in 0:length(persons)+1){
# BOOK COLUMN (PULL INTO LIST, THEN CONCATENATE)
books <- c(Data[Data$Name==j[k],][5])
booksconcat <- paste(books[[1]], collapse="\n")
Data2$Book[Data2$Name==j[k]] <- booksconcat
# AUTHOR COLUMN (PULL INTO LIST, THEN CONCATENATE)
authors <- c(Data[Data$Name==j[k],][6])
authorsconcat <- paste(authors[[1]], collapse="\n")
Data2$Author[Data2$Name==j[k]] <- authorsconcat
}
}