我有一个包含csv文件的目录,所有文件都有一个公共列(Class
),然后是一个整数值,尽管它们的文件长度不一致。示例[1:5, ]
:
Class Abundance_inds
1 Chaetognath 2
2 Copepod_Calanoid_Acartia_spp 9
3 Copepod_Calanoid_Centropages_spp 4
4 Copepod_Calanoid_Temora_spp 1
5 Copepod_Calanoid_Unknown 5
它们正在导出另一个R脚本,因此在合并之前第一列需要裁剪,我可以使用以下命令成功合并它们:
test <- read.csv(file = csvs[1])[ ,2:3]
test2 <- read.csv(file = csvs[2])[ ,2:3]
然后:
library(tidyverse)
mergedcsvs <- list(test, test2) %>% reduce(full_join, by = "Class")
无论有多少文件[1:4,]
,它都会带来以下预期结果:
Class Abundance_inds.x Abundance_inds.y
1 Chaetognath 2 4
2 Copepod_Calanoid_Acartia_spp 9 11
3 Copepod_Calanoid_Centropages_spp 4 8
4 Copepod_Calanoid_Temora_spp 1 NA
我也想将文件的basename
用作列标题,我知道我可以提取的是使用此方法:
basename1 <- csvs[1]
basename2 <- csvs[2]
我知道我可以创建basenames
的列表,然后使用这些列标题,但是为每个单独的CSV(有很多)创建数据框然后手动执行操作似乎是不切实际的。
由于CSV是从另一个R脚本导出的,因此它们还有一个不需要的第一列,需要将其删除。
当然有更好的方法!任何帮助都会很棒。
(我对this感到一团糟,但无法为我工作)
非常感谢
答案 0 :(得分:3)
使用末尾注释中显示的测试输入,先读取filenames
字符向量中给出的文件,然后merge
。最后设置名称。该工具包随R一起提供,因此您无需安装它。
library(tools)
LL <- Map(read.csv, filenames, as.is = TRUE)
r <- Reduce(function(...) merge(..., all = TRUE, by = "Class"), LL)
names(r)[-1] <- basename(file_path_sans_ext(filenames))
给予:
Class DF1 DF2 DF3
1 Chaetognath 2 NA 2
2 Copepod_Calanoid_Acartia_spp 9 9 9
3 Copepod_Calanoid_Centropages_spp 4 4 NA
4 Copepod_Calanoid_Temora_spp 1 1 1
5 Copepod_Calanoid_Unknown NA 5 5
取决于您要输出的内容,可能需要all = FALSE
来代替显示的all
参数。
这次我在下面为您提供了测试数据,但实际上应该在问题中提供了该数据以及您期望得到的输出。
Lines <- " Class Abundance_inds
1 Chaetognath 2
2 Copepod_Calanoid_Acartia_spp 9
3 Copepod_Calanoid_Centropages_spp 4
4 Copepod_Calanoid_Temora_spp 1
5 Copepod_Calanoid_Unknown 5"
DF <- read.table(text = Lines, as.is = TRUE)
L <- list(DF1 = DF[1:4, ], DF2 = DF[2:5, ], DF3 = DF[-3, ])
filenames <- paste0(names(L), ".csv")
for(i in seq_along(filenames)) write.csv(L[[i]], filenames[i], row.names = FALSE)
答案 1 :(得分:2)
一种可能性是将data.frames读取到嵌套的小标题中。 因此,您首先定义一个函数,该函数描述如何读取和转换单个数据帧。在您的情况下,外观如下:
library(tidyverse)
read_onecsv <- function(csvname, columnname) {
read.csv(file = csvname) %>% as_tibble() %>%
select(2:3) %>% mutate(type = columnname)
}
此函数读取一个csv文件,将其转换为小标题,选择第2列和第3列,然后创建一个包含后面列名称的伪列(名为type
)。
接下来,您将创建一个包含所有csvnames
和所有columnnames
的小标题,然后运行以下命令:
tibble(csvnames = c("csv1.csv", "csv2.csv"), columnnames = c("col1", "col2")) %>%
mutate(data = map2(csvnames, columnnames, read_onecsv))%>%
unnest() %>%
spread(type, Abundance_inds)
答案 2 :(得分:2)
使用来自fread
的快速library(data.table)
library(tidyverse)
library(data.table)
library(tools)
write.csv(data.frame(stringsAsFactors=FALSE,
Class = c("Chaetognath", "Copepod_Calanoid_Acartia_spp",
"Copepod_Calanoid_Centropages_spp",
"Copepod_Calanoid_Temora_spp"),
Abundance_inds = c(2, 9, 4, 1)
), file = "x.csv")
write.csv(data.frame(stringsAsFactors=FALSE,
Class = c("Chaetognath", "Copepod_Calanoid_Acartia_spp",
"Copepod_Calanoid_Centropages_spp"),
Whatever = c(4, 11, 8)
), file = "y.csv")
csvPaths <- list.files(".", "\\.csv$", full.names = TRUE)
csvList <- list()
for(csvPath in csvPaths){
csvList[[csvPath]] <- fread(csvPath, col.names = c("Class", basename(file_path_sans_ext(csvPath))), drop = 1)
}
mergedcsvs <- csvList %>% reduce(full_join, by = "Class")
# Class x.csv y.csv
# 1 Chaetognath 2 4
# 2 Copepod_Calanoid_Acartia_spp 9 11
# 3 Copepod_Calanoid_Centropages_spp 4 8
# 4 Copepod_Calanoid_Temora_spp 1 NA
编辑:这是data.table
的唯一方式(避免使用library(tidyverse)
)
csvPaths <- list.files(".", "\\.csv$", full.names = TRUE)
csvList <- list()
for(csvPath in csvPaths){
csvList[[csvPath]] <- fread(csvPath, drop = 1, col.names = c("class", "vars"))[, id := basename(file_path_sans_ext(csvPath))]
}
DT <- rbindlist(csvList, use.names = FALSE)
mergedDT <- dcast.data.table(DT, class ~ id, value.var = "vars")
mergedDT
答案 3 :(得分:0)
可能使用list.files
和lapply
的解决方案。
library(readr)
## read all names with .csv at the end form your working directory and save as variable
fileNames <- list.files(pattern = '.csv')
## read all files, merge and save as tibble
fileList <- lapply(1:length(fileNames), function(i) read_csv(fileNames[i]) %>%
select(-1)
) %>%
reduce(full_join, by = 'class')
## rename columns
names(fileList) <- c(names(fileList)[1], sub('.csv', "", fileNames))
## output
# A tibble: 4 x 3
class test1 test10
<chr> <dbl> <dbl>
1 banana 1 1
2 apples 1 1
3 orange 10 NA
4 ginger NA 5
出于测试目的,我已经创建了两个.csv文件(test1.csv和test10.csv)
文件test1.csv
number, class,value
1,banana,1
2,apples,1
3,orange,10
文件test10.csv
number, class,value
1,banana,1
2,apples,1
3,ginger,5