我的数据格式类似于以下格式:
ID Language MotherTongue SpokenatHome HomeLang
1 English English English
1 French French
1 Polish Polish
2 Lebanese Lebanese Lebanese Labanese
2 Arabic Arabbic
以下是我要找的输出:
ID Language1 Language2 Language 3 MotherTongue1 MotherTongue2 SpokenatHome1 HomeLan
1 English French Polish English Polish French English
2 Lebanese Arabic Labanese Arabic
我使用reshape2包的melt和dcast功能,但它不起作用。有谁知道如何做到这一点?感谢。
df<-df[,c("OEN", "Langugae","MotherTongue", "SpokenatHome", "MainHomeLanguage")]
dfl <- melt(df, id.vars=c("OEN", "Langugae"), measure.vars=c("MotherTongue", "SpokenatHome", "MainHomeLanguage"),
variable.name="Language")
dfw <- dcast(dfl, OEN ~ Langugae , value.var="value" )
答案 0 :(得分:3)
您好,您可以尝试这一点(尽管如此,它依赖于plyr
索引语言):
df <- read.table(text="ID Language
1 English
1 French
1 Spanish
1 Polish
2 English
2 French
3 Lebanese
3 Arabic", header=T)
# For creating an index of Language by ID (there is probably a better way to do this)
library(plyr)
df <- ddply(df, .(ID), mutate, ID2 = 1:length(ID)
# The same as above without using plyr :
df$ID2 <- unlist(tapply(X = df$ID, INDEX = df$ID, FUN = function(x) 1:length(x)))
# And use reshape for doing what you want
reshape(data = df, timevar = "ID2", v.names = "Language", idvar = "ID", direction = "wide")
# ID Language.1 Language.2 Language.3 Language.4
#1 1 English French Spanish Polish
#5 2 English French <NA> <NA>
#7 3 Lebanese Arabic <NA> <NA>
与第二个数据集相同:
df2 <- read.table(text="ID Language MotherTongue SpokenatHome HomeLang
1 English English NA English
1 French NA French NA
1 Polish Polish NA NA
2 Lebanese Lebanese Lebanese Labanese
2 Arabic NA NA Arabbic", header=TRUE)
df2 <- ddply(df2, .(ID), mutate, ID2 = 1:length(ID))
reshape(data = df2, timevar = "ID2", v.names = c("Language", "MotherTongue", "SpokenatHome", "HomeLang"), idvar = "ID", direction = "wide")
答案 1 :(得分:0)
不是很优雅:
df <- read.table(header = TRUE, as.is = TRUE, text = '
ID Language
1 English
1 French
1 Spanish
1 Polish
2 English
2 French
3 Lebanese
3 Arabic')
# split by ID
sp <- tapply(df$Language, df$ID, function(x) x)
# max length
mls <- max(sapply(x, length))
# make same length
spNA <- lapply(sp, function(x) {
l <- length(x)
if(l == mls){
out <- x
} else {
out <- c(x, rep(NA, mls-l))
}
return(out)
}
)
# rbind
do.call(rbind, spNA)
# [,1] [,2] [,3] [,4]
# 1 "English" "French" "Spanish" "Polish"
# 2 "English" "French" NA NA
# 3 "Lebanese" "Arabic" NA NA
答案 2 :(得分:0)
这是一个reshape2
解决方案。与其他答案一样,我必须添加一个变量来表示每个ID
中的答案数量,我使用ddply
中的plyr
进行了答案。
df = read.table(text="ID Language MotherTongue SpokenatHome HomeLang
1 English English NA English
1 French NA French NA
1 Polish Polish NA NA
2 Lebanese Lebanese Lebanese Labanese
2 Arabic NA Arabbic NA", header=TRUE)
require(reshape2)
df1 = melt(df, id.vars=c("ID"), variable.name = "type")
require(plyr)
# Add in variable for number of unique answers per ID
df1 = ddply(df1, .(ID, type), mutate, num = 1:length(ID))
# Cast the dataset wide
df2 = dcast(df1, ID ~ type + num)
这为每个类别(Language
,HomeLang
等)提供了多个列。如果您需要删除包含所有NA
的列,则可以执行以下操作(我找到here)。
df2[colSums(is.na(df2)) < nrow(df2)]