我想知道如何重新排列源数据(表),以便使用 R 或SQL输出所需的表格,如下所示。
由于 R 中的循环非常慢,并且我的数据集非常大......在脚本中不太喜欢循环过多。效率很重要。
来源数据表:
Date | Country | ID | Fruit | Favorite | Money
20120101 US 1 Apple Book 100
20120101 US 2 Orange Knife 150
20120101 US 3 Banana Watch 80
20120101 US 4 Melon Water 90
20120102 US 1 Apple Phone 120
20120102 US 2 Apple Knife 130
20120102 US 3 Banana Watch 100
..... ...... .. ..... ...... ......
输出表:
Date | Country | Field | ID 1 | ID 2 | ID 3 | ID 4
20120101 US Fruit Apple Orange Banana Melon
20120101 US Favorite Book Knife Watch Water
20120101 US Money 100 150 80 90
20120102 US Fruit Apple Apple Banana N.A.
.... .... .... .... .... .... ....
答案 0 :(得分:0)
以下是R中的一种方法,使用您的示例数据:
x <- cbind(mydf[, c("Date", "Country", "ID")],
stack(mydf[, c("Fruit", "Favorite", "Money")]))
reshape(x, direction = "wide", idvar = c("Date", "Country", "ind"), timevar="ID")
# Date Country ind values.1 values.2 values.3 values.4
# 1 20120101 US Fruit Apple Orange Banana Melon
# 5 20120102 US Fruit Apple Apple Banana <NA>
# 8 20120101 US Favorite Book Knife Watch Water
# 12 20120102 US Favorite Phone Knife Watch <NA>
# 15 20120101 US Money 100 150 80 90
# 19 20120102 US Money 120 130 100 <NA>
要与其他选项进行对比,这里是melt
+ dcast
方法(可以从“data.table”或“reshape2”获取)和“dplyr”+“tidyr”方法
library(data.table)
dcast(
suppressWarnings(
melt(as.data.table(mydf), c("Date", "Country", "ID"))),
... ~ ID, value.var = "value")
# Date Country variable 1 2 3 4
# 1: 20120101 US Fruit Apple Orange Banana Melon
# 2: 20120101 US Favorite Book Knife Watch Water
# 3: 20120101 US Money 100 150 80 90
# 4: 20120102 US Fruit Apple Apple Banana NA
# 5: 20120102 US Favorite Phone Knife Watch NA
# 6: 20120102 US Money 120 130 100 NA
library(dplyr)
library(tidyr)
mydf %>%
gather(variable, value, Fruit:Money) %>%
spread(ID, value)
# Date Country variable 1 2 3 4
# 1 20120101 US Fruit Apple Orange Banana Melon
# 2 20120101 US Favorite Book Knife Watch Water
# 3 20120101 US Money 100 150 80 90
# 4 20120102 US Fruit Apple Apple Banana <NA>
# 5 20120102 US Favorite Phone Knife Watch <NA>
# 6 20120102 US Money 120 130 100 <NA>
在这个答案中,mydf
被定义为:
mydf <- structure(
list(Date = c(20120101L, 20120101L, 20120101L,
20120101L, 20120102L, 20120102L, 20120102L),
Country = c("US", "US", "US", "US", "US", "US", "US"),
ID = c(1L, 2L, 3L, 4L, 1L, 2L, 3L),
Fruit = c("Apple", "Orange", "Banana", "Melon",
"Apple", "Apple", "Banana"),
Favorite = c("Book", "Knife", "Watch", "Water",
"Phone", "Knife", "Watch"),
Money = c(100L, 150L, 80L, 90L, 120L, 130L, 100L)),
.Names = c("Date", "Country", "ID",
"Fruit", "Favorite", "Money"),
class = "data.frame", row.names = c(NA, -7L))