我有一个变量列表,我在数据框中作为一个长行获得,我有兴趣将这些记录重组为更有意义的格式。
我的原始数据如下所示,
df <- data.frame(name1 = "John Doe", email1 = "John@Doe.com", phone1 = "(444) 444-4444", name2 = "Jane Doe", email2 = "Jane@Doe.com", phone2 = "(444) 444-4445", name3 = "John Smith", email3 = "John@Smith.com", phone3 = "(444) 444-4446", name4 = NA, email4 = "Jane@Smith.com", phone4 = NA, name5 = NA, email5 = NA, phone5 = NA)
df
# name1 email1 phone1 name2 email2 phone2
# 1 John Doe John@Doe.com (444) 444-4444 Jane Doe Jane@Doe.com (444) 444-4445
# name3 email3 phone3 name4 email4 phone4 name5
# 1 John Smith John@Smith.com (444) 444-4446 NA Jane@Smith.com NA NA
# email5 phone5
# 1 NA NA
我想把它变成这样的格式,
df_transform <- structure(list(name = structure(c(2L, 1L, 3L, NA, NA), .Label = c("Jane Doe",
"John Doe", "John Smith"), class = "factor"), email = structure(c(3L,
1L, 4L, 2L, NA), .Label = c("Jane@Doe.com", "Jane@Smith.com",
"John@Doe.com", "John@Smith.com"), class = "factor"), phone = structure(c(1L,
2L, 3L, NA, NA), .Label = c("(444) 444-4444", "(444) 444-4445",
"(444) 444-4446"), class = "factor")), .Names = c("name", "email",
"phone"), class = "data.frame", row.names = c(NA, -5L))
df_transform
# name email phone
# 1 John Doe John@Doe.com (444) 444-4444
# 2 Jane Doe Jane@Doe.com (444) 444-4445
# 3 John Smith John@Smith.com (444) 444-4446
# 4 <NA> Jane@Smith.com <NA>
# 5 <NA> <NA> <NA>
应该补充的是,它并不总是五个记录,它可以是1到99之间的任何数字。我尝试使用reshape2
的{{1}}和`t()1但它已经通过复杂。我想有一些我根本不知道的知道方法。
答案 0 :(得分:3)
你走在正确的轨道上,试试这个:
library(reshape2)
# melt it down
df.melted = melt(t(df))
# get rid of the numbers at the end
df.melted$Var1 = sub('[0-9]+$', '', df.melted$Var1)
# cast it back
dcast(df.melted, (seq_len(nrow(df.melted)) - 1) %/% 3 ~ Var1)[,-1]
# email name phone
#1 John@Doe.com John Doe (444) 444-4444
#2 Jane@Doe.com Jane Doe (444) 444-4445
#3 John@Smith.com John Smith (444) 444-4446
#4 Jane@Smith.com <NA> <NA>
#5 <NA> <NA> <NA>
答案 1 :(得分:2)
1) reshape()首先,我们从列名中去掉数字,给出缩减的列名names0
。然后我们将列拆分为生成g
的组(其中有三个组件对应email
,name
和phone
列组。然后使用reshape
(来自R的基础)执行从长到长的转换,并从结果长数据框中选择所需的列,以排除由reshape
自动添加的列。该选择向量unique(names0)
是这样的,它以所需的方式重新排序结果列。
names0 <- sub("\\d+$", "", names(df))
g <- split(names(df), names0)
reshape(df, dir = "long", varying = g, v.names = names(g))[unique(names0)]
,最后一行给出了这个:
name email phone
1.1 John Doe John@Doe.com (444) 444-4444
1.2 Jane Doe Jane@Doe.com (444) 444-4445
1.3 John Smith John@Smith.com (444) 444-4446
1.4 <NA> Jane@Smith.com <NA>
1.5 <NA> <NA> <NA>
2) reshape2包以下是使用reshape2的解决方案。我们将rowname
列添加到df
,将melt
列添加到长格式。然后,我们将variable
列拆分为名称部分(name
,email
,phone
)和我们称之为id
的数字后缀部分。最后,我们使用dcast
将其转换回宽格式,并像之前一样选择合适的列。
library(reshape2)
m <- melt(data.frame(rowname = 1:nrow(df), df), id = 1)
mt <- transform(m,
variable = sub("\\d+$", "", variable),
id = sub("^\\D+", "", variable)
)
dcast(mt, rowname + id ~ variable)[, unique(mt$variable)]
最后一行给出了这个:
name email phone
1 John Doe John@Doe.com (444) 444-4444
2 Jane Doe Jane@Doe.com (444) 444-4445
3 John Smith John@Smith.com (444) 444-4446
4 <NA> Jane@Smith.com <NA>
5 <NA> <NA> <NA>
3)简单矩阵重塑。从列名中删除数字后缀,并将cn
设置为唯一的剩余名称。 (cn
代表列名称)。然后我们只将df
行重新整形为一个n x length(cn)矩阵,添加列名。
cn <- unique(sub("\\d+$", "", names(df)))
matrix(as.matrix(df), nc = length(cn), byrow = TRUE, dimnames = list(NULL, cn))
name email phone
[1,] "John Doe" "John@Doe.com" "(444) 444-4444"
[2,] "Jane Doe" "Jane@Doe.com" "(444) 444-4445"
[3,] "John Smith" "John@Smith.com" "(444) 444-4446"
[4,] NA "Jane@Smith.com" NA
[5,] NA NA NA
4) tapply 这个问题也可以通过一个简单的tapply
来解决。之前names0
是没有数字后缀的列名。 names.suffix
只是后缀。现在使用tapply
:
names0 <- sub("\\d+$", "", names(df))
names.suffix <- sub("^\\D+", "", names(df))
tapply(as.matrix(df), list(names.suffix, names0), c)[, unique(names0)]
最后一行给出:
name email phone
1 "John Doe" "John@Doe.com" "(444) 444-4444"
2 "Jane Doe" "Jane@Doe.com" "(444) 444-4445"
3 "John Smith" "John@Smith.com" "(444) 444-4446"
4 NA "Jane@Smith.com" NA
5 NA NA NA