我经常收到用多个标头和合并的单元格格式化的数据(是..excel)。通常,这些数据以2+个合并的单元格的形式出现,这些单元格代表样本位置,在表示该位置的参数的列中的许多观察值的顶部。我正在使用“ openxlsx”包通过如下所示的read.xlsx函数读取数据(不会仅作为参考):
read.xlsx('Mussels.xlsx',
detectDates = T,
sheet = 2,
fillMergedCells = T,
startRow = 2)
一个例子:我目前正在使用侵入性贻贝调查数据,其中14个地点中的每个地点都有两个物种的25个长度,在这里我将其简称简化:
lendat <- data.frame(site.a = c("species.1",1,1,1,1),
site.a = c("species.2",2,2,2,2),
site.b = c("species.1",3,3,3,3),
site.b = c("species.2",4,4,4,4),
check.names = F)
我希望能够编写一些代码来将这些数据重新格式化为长格式,其中列名成为名为“ site”的新列下的值,数据的第一行成为其他列名,分别表示每个物种的长度如下:
data_form <- data.frame(site = c(rep("site.a", 4), rep("site.b",4)),
species.1 = c(1,1,1,1,3,3,3,3),
species.2 = c(2,2,2,2,4,4,4,4))
基于@Ronak Shah答案的更新
在下面的接受的答案中使用带有实际数据的代码将导致没有数据的小标题。我发现在数据中引入十进制值(实际数据包含十进制值)时,过滤步骤就会出现问题。我以为这是数据格式问题(示例数据是所有因素),但即使是这样,十进制数据也会更改为NA。参见示例:
lendat <- data.frame(site.a = c("species.1", 1.1,2.2,3,4),
site.a = c("species.2",5,6,7,8),
site.b = c("species.1", 9,10,11,12),
site.b = c("species.2",13,14,15,16),
check.names = F)
str(lendat)
'data.frame': 5 obs. of 4 variables:
$ site.a: Factor w/ 5 levels "1.1","2.2","3",..: 5 1 2 3 4
$ site.a: Factor w/ 5 levels "5","6","7","8",..: 5 1 2 3 4
$ site.b: Factor w/ 5 levels "10","11","12",..: 5 4 1 2 3
$ site.b: Factor w/ 5 levels "13","14","15",..: 5 1 2 3 4
我将管道代码拆分为一行
#Get data in long format
pivot_longer(junk, cols = everything(), names_to = 'site') %>%
#Create a new column with column names
mutate(col = paste0('species', .copy)) %>%
#Remove the values from the first row
filter(!grepl('\\D', value)) %>%
#Remove .copy column which was created
select(-.copy) %>%
#Group by the new column
group_by(col) %>%
#Add a row index
mutate(row = row_number()) %>%
#Get data in wide format
pivot_wider(names_from = col, values_from = value) %>%
#Remove row index
select(-row) %>%
#Arrange data according to site information
arrange(site)
x <- pivot_longer(junk, cols = everything(), names_to = 'site')
x
x <- mutate(x, col = paste0('species', .copy))
x
x <- filter(x, !grepl('\\D', value))
x
x <- select(.data = x, -.copy)
x
x <- group_by(x, col)
x
x <- mutate(x, row = row_number())
x
x <- pivot_wider(x, names_from = col, values_from = value)
x
x <- select(x, -row)
x
x <- arrange(x, site)
x
执行代码,但将NA保留在最后的小标题中。
答案 0 :(得分:0)
使用dplyr
和tidyr
:
library(dplyr)
library(tidyr)
#Get data in long format
pivot_longer(lendat, cols = everything(), names_to = 'site') %>%
#Create a new column with column names
mutate(col = paste0('species', .copy)) %>%
#Remove the values from the first row
filter(!grepl('[A-Za-z]', value)) %>%
#Remove .copy column which was created
select(-.copy) %>%
#Group by the new column
group_by(col) %>%
#Add a row index
mutate(row = row_number()) %>%
#Get data in wide format
pivot_wider(names_from = col, values_from = value) %>%
#Remove row index
select(-row) %>%
#Arrange data according to site information
arrange(site)
# site species1 species2
# <chr> <chr> <chr>
#1 site.a 1.1 5
#2 site.a 2.2 6
#3 site.a 3 7
#4 site.a 4 8
#5 site.b 9 13
#6 site.b 10 14
#7 site.b 11 15
#8 site.b 12 16