Question

我有一个带有多个表的csv，变量存储在行和列中关于这个csv：

我想要“宽”到“长”
一个csv
每个“数据框”都有不同类型的变量

> df3
     V1          V2    V3     V4      V5     V6      V7    V8
1   nyc 123 main st month      1       2      3       4     5
2   nyc 123 main st     x  58568  567567 567909   35876 56943
3   nyc 123 main st     y   5345    3673   3453    3467   788
4   nyc 123 main st     z  53223  563894 564456   32409 56155
5                                                            
6    la  63 main st month      1       2      3       4     5
7    la  63 main st     a  87035 7467456   3363     863 43673
8    la  63 main st     b    345     456    345     678   345
9    la  63 main st     c  86690 7467000   3018     185 43328
10                                                           
11   sf 953 main st month      1       2      3       4     5
12   sf 953 main st     x 457456    3455 345345   56457  3634
13   sf 953 main st     b   5345    3673   3453    3467   788
14   sf 953 main st     z 452111    -218 341892   52990  2846

> df4
18 city     address month      x       y      z       a     b       c
19  nyc 123 main st     1  58568    5345  53223    null  null    null
20  nyc 123 main st     2 567567    3673 563894    null  null    null
21  nyc 123 main st     3 567909    3453 564456    null  null    null
22  nyc 123 main st     4  35876    3467  32409    null  null    null
23  nyc 123 main st     5  56943     788  56155    null  null    null
24   la  63 main st     1   null    null   null   87035   345   86690
25   la  63 main st     2   null    null   null 7467456   456 7467000
26   la  63 main st     3   null    null   null    3363   345    3018
27   la  63 main st     4   null    null   null     863   678     185
28   la  63 main st     5   null    null   null   43673   345   43328
29   sf 953 main st     1 457456    null 452111    null  5345    null
30   sf 953 main st     2   3455    null   -218    null  3673    null
31   sf 953 main st     3 345345    null 341892    null  3453    null
32   sf 953 main st     4  56457    null  52990    null  3467    null
33   sf 953 main st     5   3634    null   2846    null   788    null

顶部是我拥有的数据，底部是我想要的转换。

我最喜欢R，但我正在练习Python，所以任何方法都有效。

Answer 1

如果你的df有适当的列名，首先会有所帮助，请在读入数据后插入列名。

我使用以下库dplyr和stringr进行此分析，并重命名前3列：

df <- data.frame(stringsAsFactors=FALSE,
        city = c("nyc", "nyc", "nyc"),
     address = c("123 main st", "123 main st", "123 main st"),
       month = c("x", "y", "z"),
          X1 = c(58568L, 5345L, 53223L),
          X2 = c(567567L, 3673L, 563894L),
          X3 = c(567909L, 3453L, 564456L),
          X4 = c(35876L, 3467L, 32409L),
          X5 = c(56943L, 788L, 56155L)
)

df %>% gather(Type, Value, -c(city:month)) %>% 
        spread(month, Value) %>%
        mutate(month = str_sub(Type, 2, 2)) %>%
        select(-Type) %>%
        select(c(city, address, month, x:z))

city     address month      x    y      z
1  nyc 123 main st     1  58568 5345  53223
2  nyc 123 main st     2 567567 3673 563894
3  nyc 123 main st     3 567909 3453 564456
4  nyc 123 main st     4  35876 3467  32409
5  nyc 123 main st     5  56943  788  56155

Answer 2

OP提供的样本数据集表明csv文件中的所有数据帧

具有相同的结构，即列的相同数量，名称和位置
和每月列V4至V8指的是所有“子框架”的相同月份1到5。

如果这是真的那么我们可以将整个csv文件视为一个数据框，并通过使用melt()和dcast()从{转换为所需格式{1}}包：

data.table

library(data.table)
setDT(df3)[, melt(.SD, id.vars = paste0("V", 1:3), na.rm = TRUE)][
  V3 != "month", dcast(.SD, V1 + V2 + rleid(variable) ~ forcats::fct_inorder(V3))][
    , setnames(.SD, 1:3, c("city", "address", "month"))]

此处使用来自Hadley的city address month x y z a b c 1: la 63 main st 1 NA NA NA 87035 345 86690 2: la 63 main st 2 NA NA NA 7467456 456 7467000 3: la 63 main st 3 NA NA NA 3363 345 3018 4: la 63 main st 4 NA NA NA 863 678 185 5: la 63 main st 5 NA NA NA 43673 345 43328 6: nyc 123 main st 1 58568 5345 53223 NA NA NA 7: nyc 123 main st 2 567567 3673 563894 NA NA NA 8: nyc 123 main st 3 567909 3453 564456 NA NA NA 9: nyc 123 main st 4 35876 3467 32409 NA NA NA 10: nyc 123 main st 5 56943 788 56155 NA NA NA 11: sf 953 main st 1 457456 NA 452111 NA 5345 NA 12: sf 953 main st 2 3455 NA -218 NA 3673 NA 13: sf 953 main st 3 345345 NA 341892 NA 3453 NA 14: sf 953 main st 4 56457 NA 52990 NA 3467 NA 15: sf 953 main st 5 3634 NA 2846 NA 788 NA包的fct_inorder()函数来按照第一次出现而不是字母顺序a，b，c，x，y，z对列进行排序。

请注意，城市也按字母顺序排列。如果这是愚蠢的（但我怀疑是），原始订单也可以通过使用

forcats

为forcats::fct_inorder(V1) + V2 + rleid(variable) ~ forcats::fct_inorder(V3)公式。

数据

不幸的是，OP没有提供dcast()的结果，这使得重现问题中印刷的数据集变得不必要地困难：

dput(df3)

df3 <- readr::read_table(
  "     V1          V2    V3     V4      V5     V6      V7    V8
  1   nyc 123 main st month      1       2      3       4     5
  2   nyc 123 main st     x  58568  567567 567909   35876 56943
  3   nyc 123 main st     y   5345    3673   3453    3467   788
  4   nyc 123 main st     z  53223  563894 564456   32409 56155
  5                                                            
  6    la  63 main st month      1       2      3       4     5
  7    la  63 main st     a  87035 7467456   3363     863 43673
  8    la  63 main st     b    345     456    345     678   345
  9    la  63 main st     c  86690 7467000   3018     185 43328
  10                                                           
  11   sf 953 main st month      1       2      3       4     5
  12   sf 953 main st     x 457456    3455 345345   56457  3634
  13   sf 953 main st     b   5345    3673   3453    3467   788
  14   sf 953 main st     z 452111    -218 341892   52990  2846"
)
library(data.table)
setDT(df3)[, V2 := paste(X3, V2)][, c("X1", "X3") := NULL]
setDF(df3)[]

使用列和行中的变量进行从长到长的数据表转换

2 个答案:

数据