
时间:2019-07-08 16:35:26

标签: r tidyr


             2015    2015    2016    2016    2017    2017
Name         Height  Weight  Height  Weight  Height  Weight  
Alice        12      34      56      78      90      12      
Bob          55      55      55      55      55      55     


Name    Year    Height    Weight
Alice   2015    12        34
Alice   2016    56        78
Alice   2017    90        12
Bob     2015    55        55
Bob     2016    55        55
Bob     2017    55        55

如果年份行不存在,我可以看到如何使用read_excel中的gather然后使用tidyverse中的doc_id := "Can35qPeFkm9Xgmp9+aj3g==" base64_decode, err := base64.StdEncoding.DecodeString(doc_id) if err != nil { log.Fatal("error:", err) } fmt.Println(hex.EncodeToString(base64_decode)) 创建数据框,但是我不知道如何用两个标题行来做到这一点。我遇到的主要问题是,显然一列只能有一个名称,但似乎我想至少暂时为每个列使用两个名称。最好的方法是什么?

3 个答案:

答案 0 :(得分:3)


library(magrittr) # for the two-way pipe %<>%

# Start by renaming your columns to include both the year and variable
# The use of '-' to separate the parts is for convenience in the regex below
names(dat)[2:ncol(dat)] <- paste(dat[1, 2:ncol(dat)],
                                 sep = "-")
names(dat)[1] <- "Name"
names(dat) <- sub("__\\d+", "", names(dat))

# Drop the now useless first row
dat <- dat[2:nrow(dat), ]

# Transform the data
dat %<>%
  gather(key = var, value = val, -Name) %>%
  mutate(Year = sub("^.*?-", "", var),
         var = sub("-\\d+$", "", var)) %>%
  spread(key = var, value = as.numeric(val))

转换数据的管道顺序如下:首先,您的直觉是正确的,因为我们需要使用gather从宽到长转换。其次,我们创建“ Year”变量,并从临时“ var”变量中删除这些数字。最后,我们必须使用spread分隔Height和Weight变量。由于原始数据中的第二个标题行是文本,因此我们还将在该步骤中将这些值转换为数字。

答案 1 :(得分:2)

这很棘手,但是在从Excel文件中获取数据时很常见。我将您的数据粘贴到xlsx文件中,并使用readxl::read_excel进行读取,但是出于可重复性的考虑,我也在此处粘贴了dput的输出。我将col_names = F设置为仅具有伪列名,从而在行中为我提供了这两个标题级别的每一个,如下所示:


# df <- readxl::read_excel("multicols.xlsx", col_names = F)
df <- structure(list(...1 = c(NA, "Name", "Alice", "Bob"), ...2 = c("2015", 
                                                              "Height", "12", "55"), ...3 = c("2015", "Weight", "34", "55"), 
               ...4 = c("2016", "Height", "56", "55"), ...5 = c("2016", 
                                                                "Weight", "78", "55"), ...6 = c("2017", "Height", "90", "55"
                                                                ), ...7 = c("2017", "Weight", "12", "55")), row.names = c(NA, 
                                                                                                                          -4L), class = c("tbl_df", "tbl", "data.frame"))
#> # A tibble: 4 x 7
#>   ...1  ...2   ...3   ...4   ...5   ...6   ...7  
#>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 <NA>  2015   2015   2016   2016   2017   2017  
#> 2 Name  Height Weight Height Weight Height Weight
#> 3 Alice 12     34     56     78     90     12    
#> 4 Bob   55     55     55     55     55     55


(yrs <- df[1,])
#> # A tibble: 1 x 7
#>   ...1  ...2  ...3  ...4  ...5  ...6  ...7 
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>  2015  2015  2016  2016  2017  2017
(cols <- df[2,])
#> # A tibble: 1 x 7
#>   ...1  ...2   ...3   ...4   ...5   ...6   ...7  
#>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 Name  Height Weight Height Weight Height Weight


clean_names <- stringr::str_remove(paste(cols, yrs, sep = "_"), "_NA")
#> [1] "Name"        "Height_2015" "Weight_2015" "Height_2016" "Weight_2016"
#> [6] "Height_2017" "Weight_2017"


df %>%
  slice(-1:-2) %>%
#> # A tibble: 2 x 7
#>   Name  Height_2015 Weight_2015 Height_2016 Weight_2016 Height_2017
#>   <chr> <chr>       <chr>       <chr>       <chr>       <chr>      
#> 1 Alice 12          34          56          78          90         
#> 2 Bob   55          55          55          55          55         
#> # … with 1 more variable: Weight_2017 <chr>


df %>%
  slice(-1:-2) %>%
  setNames(clean_names) %>%
  gather(key, value, -Name) %>%
  separate(key, into = c("measure", "year")) %>%
  spread(key = measure, value)
#> # A tibble: 6 x 4
#>   Name  year  Height Weight
#>   <chr> <chr> <chr>  <chr> 
#> 1 Alice 2015  12     34    
#> 2 Alice 2016  56     78    
#> 3 Alice 2017  90     12    
#> 4 Bob   2015  55     55    
#> 5 Bob   2016  55     55    
#> 6 Bob   2017  55     55

答案 2 :(得分:1)


假设您的数据存储在W 34th St, New York, NY 10001, USA中。这是一种dt方法(语法比tidyverse好):



dt2 <- data.table(melt(dt, id.vars = "Name", variable.name = "Measurement", value.name = "Value"), year = rep(rep(2015:2017, each = 2), times = 2))

您会注意到,我在同一列中有> dt2 Name Measurement Value year 1: Alice Height 12 2015 2: Bob Height 55 2015 3: Alice Weight 34 2016 4: Bob Weight 55 2016 5: Alice Height 56 2017 6: Bob Height 55 2017 7: Alice Weight 78 2015 8: Bob Weight 55 2015 9: Alice Height 90 2016 10: Bob Height 55 2016 11: Alice Weight 12 2017 12: Bob Weight 55 2017 Weight个测量值。我建议您这样做,而不是为每个变量单独添加一列,因为它与Height语法兼容。

对数据感到好奇的是,您有两行作为标题。这意味着您将不得不根据数据调整我答案中的group by参数。

通常,要创建您的year = ...列,您需要:
