在R中将宽格式转换为长格式

时间:2014-06-02 13:47:49

标签: r reshape2

我的数据格式类似于以下格式:

ID Language  MotherTongue  SpokenatHome   HomeLang
1    English   English                      English
1    French                   French        
1    Polish    Polish         
2    Lebanese  Lebanese        Lebanese    Labanese
2    Arabic                    Arabbic

以下是我要找的输出:

ID  Language1 Language2 Language 3  MotherTongue1  MotherTongue2  SpokenatHome1 HomeLan
1   English    French     Polish     English         Polish        French       English
2   Lebanese   Arabic                Labanese                      Arabic

我使用reshape2包的melt和dcast功能,但它不起作用。有谁知道如何做到这一点?感谢。

df<-df[,c("OEN", "Langugae","MotherTongue", "SpokenatHome", "MainHomeLanguage")]
dfl <- melt(df, id.vars=c("OEN", "Langugae"), measure.vars=c("MotherTongue", "SpokenatHome", "MainHomeLanguage"),
            variable.name="Language")

dfw <- dcast(dfl, OEN ~  Langugae , value.var="value" )

3 个答案:

答案 0 :(得分:3)

您好,您可以尝试这一点(尽管如此,它依赖于plyr索引语言):

df <- read.table(text="ID Language
1    English
1    French
1    Spanish
1    Polish
2    English
2    French
3    Lebanese
3    Arabic", header=T)

# For creating an index of Language by ID (there is probably a better way to do this)
library(plyr)
df <- ddply(df, .(ID), mutate, ID2 =  1:length(ID)

# The same as above without using plyr :
df$ID2 <- unlist(tapply(X = df$ID, INDEX = df$ID, FUN = function(x) 1:length(x)))

# And use reshape for doing what you want
reshape(data = df, timevar = "ID2", v.names = "Language", idvar = "ID", direction = "wide")

#  ID Language.1 Language.2 Language.3 Language.4
#1  1    English     French    Spanish     Polish
#5  2    English     French       <NA>       <NA>
#7  3   Lebanese     Arabic       <NA>       <NA>

与第二个数据集相同:

df2 <- read.table(text="ID Language  MotherTongue  SpokenatHome   HomeLang
1    English   English            NA      English
1    French       NA           French        NA
1    Polish    Polish         NA           NA
2    Lebanese  Lebanese        Lebanese    Labanese
2    Arabic       NA    NA             Arabbic", header=TRUE)

df2 <- ddply(df2, .(ID), mutate, ID2 =  1:length(ID))
reshape(data = df2, timevar = "ID2", v.names = c("Language", "MotherTongue", "SpokenatHome", "HomeLang"), idvar = "ID", direction = "wide")

答案 1 :(得分:0)

不是很优雅:

df <- read.table(header = TRUE, as.is = TRUE, text = '
                 ID Language
1    English
1    French
1    Spanish
1    Polish
2    English
2    French
3    Lebanese
3    Arabic')


# split by ID
sp <- tapply(df$Language, df$ID, function(x) x)
# max length
mls <- max(sapply(x, length))

# make same length
spNA <- lapply(sp, function(x) {
  l <- length(x)
  if(l == mls){
    out <- x
  } else {
    out <- c(x, rep(NA, mls-l))
  }
  return(out)
  }
)
# rbind
do.call(rbind, spNA)

# [,1]       [,2]     [,3]      [,4]    
# 1 "English"  "French" "Spanish" "Polish"
# 2 "English"  "French" NA        NA      
# 3 "Lebanese" "Arabic" NA        NA    

答案 2 :(得分:0)

这是一个reshape2解决方案。与其他答案一样,我必须添加一个变量来表示每个ID中的答案数量,我使用ddply中的plyr进行了答案。

df = read.table(text="ID Language  MotherTongue  SpokenatHome   HomeLang
1    English   English            NA      English
1    French       NA           French        NA
1    Polish    Polish         NA           NA
2    Lebanese  Lebanese        Lebanese    Labanese
2    Arabic       NA              Arabbic NA", header=TRUE)

require(reshape2)
df1 = melt(df, id.vars=c("ID"), variable.name = "type")

require(plyr)
# Add in variable for number of unique  answers per ID
df1 = ddply(df1, .(ID, type), mutate, num = 1:length(ID))
# Cast the dataset wide
df2 = dcast(df1, ID ~ type + num)

这为每个类别(LanguageHomeLang等)提供了多个列。如果您需要删除包含所有NA的列,则可以执行以下操作(我找到here)。

df2[colSums(is.na(df2)) < nrow(df2)]