Question

我有一个如下所示的Excel工作表：

             2015    2015    2016    2016    2017    2017
Name         Height  Weight  Height  Weight  Height  Weight  
Alice        12      34      56      78      90      12      
Bob          55      55      55      55      55      55     
...

我的目标是产生一个整洁的数据框架，例如：

Name    Year    Height    Weight
Alice   2015    12        34
Alice   2016    56        78
Alice   2017    90        12
Bob     2015    55        55
Bob     2016    55        55
Bob     2017    55        55
...

如果年份行不存在，我可以看到如何使用read_excel中的gather然后使用tidyverse中的doc_id := "Can35qPeFkm9Xgmp9+aj3g==" base64_decode, err := base64.StdEncoding.DecodeString(doc_id) if err != nil { log.Fatal("error:", err) } fmt.Println(hex.EncodeToString(base64_decode))创建数据框，但是我不知道如何用两个标题行来做到这一点。我遇到的主要问题是，显然一列只能有一个名称，但似乎我想至少暂时为每个列使用两个名称。最好的方法是什么？

Answer 1

这是一个相当普遍的问题（人们实际上使用的是这样的excel工作簿），但是其中涉及到多个需要在R中解决的步骤。这是我假设您的数据帧称为dat的一种方法：

library(dplyr)
library(tidyr)
library(magrittr) # for the two-way pipe %<>%

# Start by renaming your columns to include both the year and variable
# The use of '-' to separate the parts is for convenience in the regex below
names(dat)[2:ncol(dat)] <- paste(dat[1, 2:ncol(dat)],
                                 names(dat)[2:ncol(dat)],
                                 sep = "-")
names(dat)[1] <- "Name"
names(dat) <- sub("__\\d+", "", names(dat))

# Drop the now useless first row
dat <- dat[2:nrow(dat), ]

# Transform the data
dat %<>%
  gather(key = var, value = val, -Name) %>%
  mutate(Year = sub("^.*?-", "", var),
         var = sub("-\\d+$", "", var)) %>%
  spread(key = var, value = as.numeric(val))

转换数据的管道顺序如下：首先，您的直觉是正确的，因为我们需要使用gather从宽到长转换。其次，我们创建“ Year”变量，并从临时“ var”变量中删除这些数字。最后，我们必须使用spread分隔Height和Weight变量。由于原始数据中的第二个标题行是文本，因此我们还将在该步骤中将这些值转换为数字。

Answer 2

这很棘手，但是在从Excel文件中获取数据时很常见。我将您的数据粘贴到xlsx文件中，并使用readxl::read_excel进行读取，但是出于可重复性的考虑，我也在此处粘贴了dput的输出。我将col_names = F设置为仅具有伪列名，从而在行中为我提供了这两个标题级别的每一个，如下所示：

library(dplyr)
library(tidyr)

# df <- readxl::read_excel("multicols.xlsx", col_names = F)
df <- structure(list(...1 = c(NA, "Name", "Alice", "Bob"), ...2 = c("2015", 
                                                              "Height", "12", "55"), ...3 = c("2015", "Weight", "34", "55"), 
               ...4 = c("2016", "Height", "56", "55"), ...5 = c("2016", 
                                                                "Weight", "78", "55"), ...6 = c("2017", "Height", "90", "55"
                                                                ), ...7 = c("2017", "Weight", "12", "55")), row.names = c(NA, 
                                                                                                                          -4L), class = c("tbl_df", "tbl", "data.frame"))
df
#> # A tibble: 4 x 7
#>   ...1  ...2   ...3   ...4   ...5   ...6   ...7  
#>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 <NA>  2015   2015   2016   2016   2017   2017  
#> 2 Name  Height Weight Height Weight Height Weight
#> 3 Alice 12     34     56     78     90     12    
#> 4 Bob   55     55     55     55     55     55

年份在第一行中，而度量在第二行中，所以我将其中的每个都拉出来了：

(yrs <- df[1,])
#> # A tibble: 1 x 7
#>   ...1  ...2  ...3  ...4  ...5  ...6  ...7 
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>  2015  2015  2016  2016  2017  2017
(cols <- df[2,])
#> # A tibble: 1 x 7
#>   ...1  ...2   ...3   ...4   ...5   ...6   ...7  
#>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 Name  Height Weight Height Weight Height Weight

然后我将这两个粘贴在一起以获得一个体面的列名向量：

clean_names <- stringr::str_remove(paste(cols, yrs, sep = "_"), "_NA")
clean_names
#> [1] "Name"        "Height_2015" "Weight_2015" "Height_2016" "Weight_2016"
#> [6] "Height_2017" "Weight_2017"

现在我可以删除这两行并设置适当的名称：

df %>%
  slice(-1:-2) %>%
  setNames(clean_names)
#> # A tibble: 2 x 7
#>   Name  Height_2015 Weight_2015 Height_2016 Weight_2016 Height_2017
#>   <chr> <chr>       <chr>       <chr>       <chr>       <chr>      
#> 1 Alice 12          34          56          78          90         
#> 2 Bob   55          55          55          55          55         
#> # … with 1 more variable: Weight_2017 <chr>

最后，将数据重塑为长形，将键分为小节（高度或重量）和年份，然后再扩展回宽形。

df %>%
  slice(-1:-2) %>%
  setNames(clean_names) %>%
  gather(key, value, -Name) %>%
  separate(key, into = c("measure", "year")) %>%
  spread(key = measure, value)
#> # A tibble: 6 x 4
#>   Name  year  Height Weight
#>   <chr> <chr> <chr>  <chr> 
#> 1 Alice 2015  12     34    
#> 2 Alice 2016  56     78    
#> 3 Alice 2017  90     12    
#> 4 Bob   2015  55     55    
#> 5 Bob   2016  55     55    
#> 6 Bob   2017  55     55

Answer 3

您想要做的就是将数据从宽格式转变为长格式或将其重塑。

假设您的数据存储在W 34th St, New York, NY 10001, USA中。这是一种dt方法（语法比tidyverse好）：

data.table

输出：

library(data.table)
dt2 <- data.table(melt(dt, id.vars = "Name", variable.name = "Measurement", value.name = "Value"), year = rep(rep(2015:2017, each = 2), times = 2))

您会注意到，我在同一列中有> dt2 Name Measurement Value year 1: Alice Height 12 2015 2: Bob Height 55 2015 3: Alice Weight 34 2016 4: Bob Weight 55 2016 5: Alice Height 56 2017 6: Bob Height 55 2017 7: Alice Weight 78 2015 8: Bob Weight 55 2015 9: Alice Height 90 2016 10: Bob Height 55 2016 11: Alice Weight 12 2017 12: Bob Weight 55 2017和Weight个测量值。我建议您这样做，而不是为每个变量单独添加一列，因为它与Height语法兼容。

对数据感到好奇的是，您有两行作为标题。这意味着您将不得不根据数据调整我答案中的group by参数。

通常，要创建您的year = ...列，您需要：

year

使用两行作为标头（宽到长）重塑多变量数据

3 个答案: