使用R / SQL进行数据重新排列而不进行循环

时间:2013-07-18 09:02:39

标签: sql r dataset reshape large-data

我想知道如何重新排列源数据(表),以便使用 R 或SQL输出所需的表格,如下所示。

由于 R 中的循环非常慢,并且我的数据集非常大......在脚本中不太喜欢循环过多。效率很重要。

来源数据表:

Date    | Country | ID | Fruit  | Favorite | Money
20120101  US        1    Apple     Book      100
20120101  US        2    Orange    Knife     150
20120101  US        3    Banana    Watch     80
20120101  US        4    Melon     Water     90
20120102  US        1    Apple     Phone     120
20120102  US        2    Apple     Knife     130
20120102  US        3    Banana    Watch     100           
.....     ......    ..   .....     ......    ......

输出表:

Date    | Country | Field   | ID 1 | ID 2  | ID 3  | ID 4
20120101  US        Fruit     Apple  Orange  Banana  Melon
20120101  US        Favorite  Book   Knife   Watch   Water
20120101  US        Money     100    150     80      90
20120102  US        Fruit     Apple  Apple   Banana  N.A.
....      ....      ....      ....   ....    ....    ....

1 个答案:

答案 0 :(得分:0)

以下是R中的一种方法,使用您的示例数据:

x <- cbind(mydf[, c("Date", "Country", "ID")], 
           stack(mydf[, c("Fruit", "Favorite", "Money")]))
reshape(x, direction = "wide", idvar = c("Date", "Country", "ind"), timevar="ID")
#        Date Country      ind values.1 values.2 values.3 values.4
# 1  20120101      US    Fruit    Apple   Orange   Banana    Melon
# 5  20120102      US    Fruit    Apple    Apple   Banana     <NA>
# 8  20120101      US Favorite     Book    Knife    Watch    Water
# 12 20120102      US Favorite    Phone    Knife    Watch     <NA>
# 15 20120101      US    Money      100      150       80       90
# 19 20120102      US    Money      120      130      100     <NA>

要与其他选项进行对比,这里是melt + dcast方法(可以从“data.table”或“reshape2”获取)和“dplyr”+“tidyr”方法

library(data.table)
dcast(
  suppressWarnings(
    melt(as.data.table(mydf), c("Date", "Country", "ID"))),
    ... ~ ID, value.var = "value")
#        Date Country variable     1      2      3     4
# 1: 20120101      US    Fruit Apple Orange Banana Melon
# 2: 20120101      US Favorite  Book  Knife  Watch Water
# 3: 20120101      US    Money   100    150     80    90
# 4: 20120102      US    Fruit Apple  Apple Banana    NA
# 5: 20120102      US Favorite Phone  Knife  Watch    NA
# 6: 20120102      US    Money   120    130    100    NA  

library(dplyr)
library(tidyr)
mydf %>%
  gather(variable, value, Fruit:Money) %>%
  spread(ID, value)
#       Date Country variable     1      2      3     4
# 1 20120101      US    Fruit Apple Orange Banana Melon
# 2 20120101      US Favorite  Book  Knife  Watch Water
# 3 20120101      US    Money   100    150     80    90
# 4 20120102      US    Fruit Apple  Apple Banana  <NA>
# 5 20120102      US Favorite Phone  Knife  Watch  <NA>
# 6 20120102      US    Money   120    130    100  <NA>

在这个答案中,mydf被定义为:

mydf <- structure(
  list(Date = c(20120101L, 20120101L, 20120101L, 
                20120101L, 20120102L, 20120102L, 20120102L), 
       Country = c("US", "US", "US", "US", "US", "US", "US"), 
       ID = c(1L, 2L, 3L, 4L, 1L, 2L, 3L),
       Fruit = c("Apple", "Orange", "Banana", "Melon", 
                 "Apple", "Apple", "Banana"), 
       Favorite = c("Book", "Knife", "Watch", "Water", 
                    "Phone", "Knife", "Watch"), 
       Money = c(100L, 150L, 80L, 90L, 120L, 130L, 100L)), 
  .Names = c("Date", "Country", "ID", 
             "Fruit", "Favorite", "Money"), 
  class = "data.frame", row.names = c(NA, -7L))