将具有多个标题的数据重新格式化为长格式,其中一个标题行成为新列中的数据

时间:2020-05-13 22:11:03

标签: r

我经常收到用多个标头和合并的单元格格式化的数据(是..excel)。通常,这些数据以2+个合并的单元格的形式出现,这些单元格代表样本位置,在表示该位置的参数的列中的许多观察值的顶部。我正在使用“ openxlsx”包通过如下所示的read.xlsx函数读取数据(不会仅作为参考):

read.xlsx('Mussels.xlsx',
              detectDates = T,
              sheet = 2,
              fillMergedCells = T, 
              startRow = 2)

一个例子:我目前正在使用侵入性贻贝调查数据,其中14个地点中的每个地点都有两个物种的25个长度,在这里我将其简称简化:

lendat <- data.frame(site.a = c("species.1",1,1,1,1),
                     site.a = c("species.2",2,2,2,2), 
                     site.b = c("species.1",3,3,3,3),
                     site.b = c("species.2",4,4,4,4),
                     check.names = F)

我希望能够编写一些代码来将这些数据重新格式化为长格式,其中列名成为名为“ site”的新列下的值,数据的第一行成为其他列名,分别表示每个物种的长度如下:

data_form <- data.frame(site = c(rep("site.a", 4), rep("site.b",4)),
                        species.1 = c(1,1,1,1,3,3,3,3),
                        species.2 = c(2,2,2,2,4,4,4,4))

基于@Ronak Shah答案的更新

在下面的接受的答案中使用带有实际数据的代码将导致没有数据的小标题。我发现在数据中引入十进制值(实际数据包含十进制值)时,过滤步骤就会出现问题。我以为这是数据格式问题(示例数据是所有因素),但即使是这样,十进制数据也会更改为NA。参见示例:

lendat <- data.frame(site.a = c("species.1", 1.1,2.2,3,4),
                     site.a = c("species.2",5,6,7,8), 
                     site.b = c("species.1", 9,10,11,12),
                     site.b = c("species.2",13,14,15,16),
                     check.names = F)
str(lendat)
'data.frame':   5 obs. of  4 variables:
 $ site.a: Factor w/ 5 levels "1.1","2.2","3",..: 5 1 2 3 4
 $ site.a: Factor w/ 5 levels "5","6","7","8",..: 5 1 2 3 4
 $ site.b: Factor w/ 5 levels "10","11","12",..: 5 4 1 2 3
 $ site.b: Factor w/ 5 levels "13","14","15",..: 5 1 2 3 4

我将管道代码拆分为一行

#Get data in long format
pivot_longer(junk, cols = everything(), names_to = 'site') %>%
  #Create a new column with column names
  mutate(col = paste0('species', .copy)) %>%
  #Remove the values from the first row
  filter(!grepl('\\D', value)) %>%
  #Remove .copy column which was created
  select(-.copy) %>%
  #Group by the new column
  group_by(col) %>%
  #Add a row index
  mutate(row = row_number()) %>%
  #Get data in wide format
  pivot_wider(names_from = col, values_from = value) %>%
  #Remove row index
  select(-row) %>%
  #Arrange data according to site information
  arrange(site)

x <- pivot_longer(junk, cols = everything(), names_to = 'site')
x
x <- mutate(x, col = paste0('species', .copy))
x
x <- filter(x, !grepl('\\D', value))
x
x <- select(.data = x, -.copy)
x
x <- group_by(x, col)
x
x <- mutate(x, row = row_number())
x
x <- pivot_wider(x, names_from = col, values_from = value)
x
x <- select(x, -row)
x
x <- arrange(x, site)
x

执行代码,但将NA保留在最后的小标题中。

1 个答案:

答案 0 :(得分:0)

使用dplyrtidyr

library(dplyr)
library(tidyr)

#Get data in long format
pivot_longer(lendat, cols = everything(), names_to = 'site') %>%
    #Create a new column with column names
    mutate(col = paste0('species', .copy)) %>%
    #Remove the values from the first row
    filter(!grepl('[A-Za-z]', value)) %>%
    #Remove .copy column which was created
    select(-.copy) %>%
    #Group by the new column
    group_by(col) %>%
    #Add a row index
    mutate(row = row_number()) %>%
    #Get data in wide format
    pivot_wider(names_from = col, values_from = value) %>%
    #Remove row index
    select(-row) %>%
    #Arrange data according to site information
    arrange(site)


#   site   species1 species2
#  <chr>  <chr>    <chr>   
#1 site.a 1.1      5       
#2 site.a 2.2      6       
#3 site.a 3        7       
#4 site.a 4        8       
#5 site.b 9        13      
#6 site.b 10       14      
#7 site.b 11       15      
#8 site.b 12       16