Question

我已经加载了20个带有函数的csv文件：

tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))

或

list_of_data = lapply(tbl, read.csv)

它看起来如何：

> head(tbl)
[1] "F1.csv"          "F10_noS3.csv"    "F11.csv"         "F12.csv"         "F12_noS7_S8.csv"
[6] "F13.csv"

我必须将所有这些文件合并为一个。我们将其称为主文件，但让我们尝试制作一个包含所有名称的表。在所有这些csv文件中都有一个名为“Accession”的列。我想从所有这些csv文件中创建一个包含所有“名称”的表。当然，许多种质可以在不同的csv文件中重复。我想保留所有与加入相对应的数据。

一些问题：

其中一些“名字”是相同的，我不想复制它们
其中一些“名字”几乎相同。不同的是，有名称后成为点和数字。
列数可以不同是那些csv文件。

这是显示这些数据的截图： http://imageshack.com/a/img811/7103/29hg.jpg

让我告诉你它的外观：

AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--

<-- = 相同的样本，不同的名称。应该被视为一个。所以只需忽略点和数字。

有可能吗？

我无法做dput(head)，因为它的数据集太大了。

我尝试使用这样的代码：

all_data = do.call(rbind, list_of_data)
Error in rbind(deparse.level, ...) : 
The number of columns is not correct.


all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))

我试着做了差不多两个星期而且我无法做到。所以请帮助我。

Answer 1

您的问题似乎包含多个子问题。我鼓励你把它们分开。

您显然需要的第一件事是将数据框与不同的列组合在一起。您可以使用rbind.fill包中的plyr：

library(plyr)
all_data = do.call(rbind.fill, list_of_data)

Answer 2

这是一个使用一些tidyverse函数和一个自定义函数的示例，该函数可以将缺少列的多个csv文件组合到一个数据帧中：

library(tidyverse)

# specify the target directory
dir_path <- '~/test_dir/' 

# specify the naming format of the files. 
# in this case csv files that begin with 'test' and a single digit but it could be as just as simple as 'csv'
re_file <- '^test[0-9]\\.csv'

# create sample data with some missing columns 
df_mtcars <- mtcars %>% rownames_to_column('car_name')
write.csv(df_mtcars %>% select(-am), paste0(dir_path, 'test1.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-wt, -gear), paste0(dir_path, 'test2.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-cyl), paste0(dir_path, 'test3.csv'), row.names = FALSE)

# custom function that takes the target directory and file name pattern as arguments
read_dir <- function(dir_path, file_name){
  x <- read_csv(paste0(dir_path, file_name)) %>% 
    mutate(file_name = file_name) %>% # add the file name as a column              
    select(file_name, everything())   # reorder the columns so file name is first
  return(x)
}

# read the files from the target directory that match the naming format and combine into one data frame
df_panel <-
  list.files(dir_path, pattern = re_file) %>% 
  map_df(~ read_dir(dir_path, .))

# files with missing columns are filled with NAs.

将一些csv文件合并为一个不同数量的列

2 个答案: