R:将多个列表合并为一个数据帧

时间:2017-07-18 14:25:39

标签: r database list dataframe merge

如果已经在其他地方询问/回答过,请提前道歉,但我已经看了很多线程,到目前为止没有任何工作。

我正致力于将大量数据文件合并到一个数据库中。每个数据文件具有相同的前四个(有时五个)列名称,这些列名称是单词/字符,但在此之后,列名称用数字标记,并且是不同的(它们对于一组文件分别是相同的) ,但对于不同的文件集不同)。所以举一个例子,假设文件1有列a,b,c,d,1.1,1.2,2.3,文件2有列a,b,c,d,1.3,1.4,2.1,文件3有列a ,b,c,e,3.2,5.1。

然后每个文件包含不同数量的观察结果(第一列始终是报告日期)。一些观察是数字的,另一些是字符。 我想一次读取所有文件并将它们组合成一个数据框 (1)与其他文件共享的列合并为一个, (2)自动添加不同的列,并且 (3)对于那些在前四/五列中观察到的值(即报告日期和类似规范)全部相同的观察,观察结果输入同一行。例如,如果文件1和2在列a,b,c和d上相同,但文件1在列1.1,1.2,2.3中有观察,其余部分丢失,文件2在列a,b,c中有观察,d,1.3,1.4,2.1和其他没有,我希望这些观察只是添加在同一行。 (到目前为止,我能够做的最好的事情是为原始文件中的每一行设置一个单独的行,这导致我的最终结果主要由NAs /空单元组成,并且不是非常紧凑或可用。 )

我有大量的文件,每个文件的长度都不同,我想读取它们并使用循环一次合并它们。我到目前为止所管理的内容如下:

# packages
library(data.table)
library(plyr)
library(reshape)
library(dplyr)

#make a list of files in a folder and label it "filenames"
filenames <- list.files("path", full.names = T)

#read each element in "filenames" into R and label the resulting list "csvs"
csvs <- lapply(filenames, read.csv)

#merge all elements of "csvs" into one data table
merged.sheet = Reduce(function(...) merge(..., all=T), csvs)

#export table as csv
write.csv(merged.sheet, "path")

结果包含了我想要的所有数据,它只添加了一列,正如我所希望的那样(尽管列的顺序很奇怪,我不知道如何按照我想要的方式对它进行排序,加上R由于某种原因,为每个列名添加了一个X)。然而,它根本不紧凑,因为它只是将一行放在另一行之下,即使这些也可以合并,因为识别值(日期,类别等)是相同的,并且观察在不同的列中。

我已经玩了很多,并广泛搜索,但到目前为止没有任何工作。例如,我在合并之前尝试过setkey但是给出了一个错误,因为我最初阅读的是一个列表,而不是数据帧;我已经尝试了各种融合功能,但它们都返回了一个错误(当我指定了ID变量时,R告诉我它无法在数据中找到,即使我已经明确地复制了它们)或者没有& #39; t将我的编号列标识为ID并省略了相当多的数据(当我没有指定ID变量时)。我也尝试将参数传递给合并函数,例如通过=&#34; a&#34;,但这并没有给我我想要的结果。我试过=&#34; a&#34;,by.x =&#34; b&#34;和by.y =&#34; c&#34;,它返回了一条错误信息(关于长度的东西)论证不正确)。将几个参数传递给by也会返回错误(因为只允许一个唯一的列名)。

我是R的新手,无法想到其他任何事情。任何帮助将不胜感激!

编辑: 我已经创建了一些示例数据来说明我的数据集的样子。样本数据由5个文件组成。

EDIT2:以下是5个示例文件的结构:

dput(File1)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"
), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"), 
    Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA, 
    34343L, NA, NA, 360000000L, 1000000000L, 13500000L), X2.2 = c(NA, 
    NA, NA, NA, NA, NA, 520000000L, 270000000L, 178L), X3.1 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(540000, 3.02e+08, 
    150, NA, NA, NA, 11111111, 2323232, 102)), .Names = c("ReportDate", 
"RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, 
-9L))
> dput(File2)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"
), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"), 
    Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA, 
    76000L, NA, NA, 13000000L, 13000000L, 24000L), X2.2 = c(NA, 
    NA, NA, NA, NA, NA, 90909090L, 325500L, 198000L), X3.1 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(1.6e+10, 2434340000, 
    2.8e+10, NA, NA, NA, 500, 21000, 6.5e+10)), .Names = c("ReportDate", 
"RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, 
-9L))
> dput(File3)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), 
    RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), 
    RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e"), class = "factor"), 
    Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"), 
    Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz"), class = "factor"), 
    X3.7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, 7000000L, 650404040L), X3.8 = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, 
    NA, NA, NA, NA, NA, 123456, 1.7e+11, NA, NA, 50004444, 50004444, 
    1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA, 
    NA, 1.7e+10, 2.8005e+10, NA, NA, 3e+09, 3e+09, 4e+09, 4e+09, 
    3.5e+09, 3.5e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.3434e+10, 
    4.3434e+10, NA, NA, 3870015600, 3762897490, 54545454, 7006666, 
    9.3e+11, 7675030303)), .Names = c("ReportDate", "RL", "RLI", 
"Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", "X3.11", 
"X3.12"), class = "data.frame", row.names = c(NA, -16L))
> dput(File4)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), 
    RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), 
    RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 
    4L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e", "f"), class = "factor"), 
    Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"), 
    Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
    3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz", "yx"
    ), class = "factor"), X3.7 = c(NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, 1900000L, 630404040L), X3.8 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
    ), X3.9 = c(NA, NA, NA, NA, NA, NA, 503456, 1.27e+11, NA, 
    NA, 51004444, 51004444, 1200000, 1200000, NA, NA), X3.11 = c(NA, 
    NA, NA, NA, NA, NA, 1.6e+10, 1.3005e+10, NA, NA, 3e+09, 4.3e+09, 
    4e+09, 4e+09, 2.8e+09, 2.8e+09), X3.12 = c(NA, NA, NA, NA, 
    NA, NA, 4.4434e+10, 4.4434e+10, NA, NA, 4070015600, 3762897490, 
    54545454, 8006666, 9.3e+10, 7585030303)), .Names = c("ReportDate", 
"RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", 
"X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L
))
> dput(File5)
structure(list(ReportDate = structure(c(1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), 
    RL = structure(c(1L, 1L, 1L), .Label = "Service2", class = "factor"), 
    RLI = structure(c(1L, 1L, 1L), .Label = "cd", class = "factor"), 
    Identifier1 = structure(c(2L, 1L, 2L), .Label = c("h", "j"
    ), class = "factor"), Identifier2 = structure(c(1L, 2L, 2L
    ), .Label = c("xz", "yx"), class = "factor"), X5.1 = c(656565L, 
    2340808L, NA), X5.2 = c(104L, NA, NA), X5.4 = c(64343L, NA, 
    NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", 
"Identifier2", "X5.1", "X5.2", "X5.4"), class = "data.frame", row.names = c(NA, 
-3L))

实际的数据集看起来或多或少像这样,除了变量有不同的名称和观察,有数百个文件,并且有更多的编号变量(变量之前的X由R添加,他们&# 39;不在原始的csv文件中)。 (另外,在示例数据中,我只添加了数字和空单元格,但原始数据集中的一些观察结果是字符。)

我希望将这些文件合并为一种,使结果尽可能紧凑。例如,文件5中的观察1应该与文件4中的观察1在同一行上,因为日期,RL,RLI和标识符1和2对于两者都是相同的,并且观察在不同的列中。但是如果日期或其他标识符之一不同,那么它们应该在不同的行上。

到目前为止,我的三个主要尝试如下:

# packages
library(data.table)
library(openxlsx)
library(plyr)
library(reshape)
library(dplyr)
library(tidyverse)
library(purrr)

##attempt 1

#make a list of files in a folder and label it "allFiles"

pathName <- "path"
allFiles <- list.files(pathName, full.names = T) 
allFiles <- lapply(allFiles, read.csv)

#merge all elements of "allFiles" into one datatable
merged.sheet = Reduce(function(...) merge(..., all=T), allFiles)

这是目前为止最好的方法。 merged.sheet的结构是

structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("30/10/2016", 
"01/12/2016"), class = "factor"), RL = structure(c(1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Service 1", 
"Service 2", "Service2"), class = "factor"), RLI = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 
4L), .Label = c("ab", "cd", "f", "e"), class = "factor"), Identifier2 = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 
1L), .Label = c("xy", "xz", "yx"), class = "factor"), Identifier1 = structure(c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 
2L), .Label = c("h", "j"), class = "factor"), X3.7 = c(NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 630404040L, NA, NA, 1900000L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
650404040L, NA, 7000000L, NA), X3.8 = c(NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 503456, NA, NA, 
1.27e+11, NA, NA, 51004444, NA, 1200000, 51004444, NA, 1200000, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 50004444, NA, 
NA, 1.7e+11, 123456, NA, NA, NA, 50004444, NA, 1200000, NA, 1200000
), X3.11 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 1.6e+10, NA, NA, 1.3005e+10, NA, NA, 3e+09, 2.8e+09, 4e+09, 
4.3e+09, 2.8e+09, 4e+09, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 3e+09, NA, NA, 2.8005e+10, 1.7e+10, NA, NA, NA, 3e+09, 
3.5e+09, 4e+09, 3.5e+09, 4e+09), X3.12 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 4.4434e+10, NA, NA, 4.4434e+10, 
NA, NA, 4070015600, 7585030303, 8006666, 3762897490, 9.3e+10, 
54545454, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3870015600, 
NA, NA, 4.3434e+10, 4.3434e+10, NA, NA, NA, 3762897490, 7675030303, 
7006666, 9.3e+11, 54545454), X2.1 = c(34343L, NA, NA, NA, NA, 
NA, 360000000L, 1000000000L, 13500000L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 76000L, 
NA, 13000000L, 13000000L, 24000L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), X2.2 = c(NA, NA, NA, NA, 
NA, NA, 520000000L, 270000000L, 178L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
325500L, 90909090L, 198000L, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA), X3.1 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(NA, 
150, 540000, 3.02e+08, NA, NA, 11111111, 2323232, 102, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2434340000, 
1.6e+10, 2.8e+10, NA, NA, NA, 21000, 500, 6.5e+10, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X5.1 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 656565L, 656565L, 
656565L, 2340808L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), X5.2 = c(NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 104L, 104L, 104L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X5.4 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 64343L, 64343L, 
64343L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier2", 
"Identifier1", "X3.7", "X3.8", "X3.9", "X3.11", "X3.12", "X2.1", 
"X2.2", "X3.1", "X3.5", "X5.1", "X5.2", "X5.4"), row.names = c(NA, 
-50L), class = "data.frame")

主要问题是数据库不是很紧凑。即使所有标识符都相同,原始文件中单独行上的所有内容都位于合并工作表中的单独行上。这使得数据集在用实际数据完成时非常大,并且大多数单元格是NA。

##attempt 2

pathName <- "path"
allFiles <- list.files(pathName, full.names = T) 
allFiles <- lapply(allFiles, read.csv)

#df <- allFiles %>% purrr::reduce(dplyr::left_join, by = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2"))
df <- allFiles %>% purrr::reduce(dplyr::left_join, by = c("ReportDate", "RL", "RLI"))

这没有用。我不知道到底出了什么问题,但df中缺少大部分数据,而且我不知道该结构会产生什么结果:

structure(list(ReportDate = c("30/10/2016", "30/10/2016", "30/10/2016", 
"30/10/2016", "30/10/2016", "30/10/2016", "30/10/2016", "30/10/2016", 
"30/10/2016"), RL = c("Service 1", "Service 1", "Service 1", 
"Service 1", "Service 1", "Service 2", "Service 2", "Service 2", 
"Service 2"), RLI = c("ab", "ab", "ab", "ab", "ab", "ab", "cd", 
"cd", "f"), Identifier2.x = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "xy", class = "factor"), X2.1.x = c(NA, 
NA, NA, 34343L, NA, NA, 360000000L, 1000000000L, 13500000L), 
    X2.2.x = c(NA, NA, NA, NA, NA, NA, 520000000L, 270000000L, 
    178L), X3.1.x = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5.x = c(540000, 
    3.02e+08, 150, NA, NA, NA, 11111111, 2323232, 102), Identifier2.y = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = "xy", class = "factor"), 
    X2.1.y = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), X2.2.y = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), X3.1.y = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5.y = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Identifier1.x = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
    ), class = "factor"), Identifier2.x.x = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("xy", 
    "xz"), class = "factor"), X3.7.x = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), X3.8.x = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA), X3.9.x = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
    ), X3.11.x = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), X3.12.x = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Identifier1.y = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
    ), class = "factor"), Identifier2.y.y = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("xy", 
    "xz", "yx"), class = "factor"), X3.7.y = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), X3.8.y = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA), X3.9.y = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
    ), X3.11.y = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), X3.12.y = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Identifier1 = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
    ), class = "factor"), Identifier2 = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("xz", 
    "yx"), class = "factor"), X5.1 = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), X5.2 = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), X5.4 = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_)), class = "data.frame", row.names = c(NA, 
-9L), .Names = c("ReportDate", "RL", "RLI", "Identifier2.x", 
"X2.1.x", "X2.2.x", "X3.1.x", "X3.5.x", "Identifier2.y", "X2.1.y", 
"X2.2.y", "X3.1.y", "X3.5.y", "Identifier1.x", "Identifier2.x.x", 
"X3.7.x", "X3.8.x", "X3.9.x", "X3.11.x", "X3.12.x", "Identifier1.y", 
"Identifier2.y.y", "X3.7.y", "X3.8.y", "X3.9.y", "X3.11.y", "X3.12.y", 
"Identifier1", "Identifier2", "X5.1", "X5.2", "X5.4"))

这是我的最后一次尝试:

##attempt 3 (Table to be read in is an Excel spreadsheet with the variable names ReportDate and so on as the top row.)
Table <- read.xlsx("path")
Table1 <- as.data.table(Table)

pathName <- "path"
allFiles <- list.files(pathName, full.names = T) 

for(i in 1:length(allFiles)) {

  dt <- read.csv(allFiles[i])
  dt1 <- as.data.table(dt)

  #set keys
  setkey(Table1, "ReportDate", "RL", "RLI", "Identifier1", "Identifier2")
  setkey(dt1, "ReportDate", "RL", "RLI", "Identifier1", "Identifier2")
  NewTable <- merge(Table1, dt1, all=TRUE)
  return(NewTable)
  rm(dt1)
}

尝试3返回以下错误消息:

Error in setkeyv(x, cols, verbose = verbose, physical = physical) : 
  some columns are not in the data.table: Identifier1 

这本身就是一个问题。我尝试合并的一些数据文件包含Identifier1,有些包含Identifier2,有些包含两者。我需要将它们用作密钥,这样,如果相关文件中存在相同的标识符,则数据将仅合并到一行上,并且对于应该在同一行上的观察值包含相同的值。但是,我的函数只允许合并所有文件中存在的键,它会出现。 但是,目前我没有这些密钥重新运行代码:

##attempt 4
Table <- read.xlsx("path")
Table1 <- as.data.table(Table)

pathName <- "path"
allFiles <- list.files(pathName, full.names = T) 

for(i in 1:length(allFiles)) {

  dt <- read.csv(allFiles[i])
  dt1 <- as.data.table(dt)

  #set keys
  setkey(Table1, "ReportDate", "RL", "RLI")
  setkey(dt1, "ReportDate", "RL", "RLI")
  NewTable <- merge(Table1, dt1, all=TRUE)
  return(NewTable)
  rm(dt1)
}

这又返回了另一个错误:

 Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'ReportDate' is a factor column being joined to i.'ReportDate' which is type 'logical'. Factor columns must join to factor or character columns. 

EDIT3:来自尝试3和4的表1的结构如下:

dput(Table1)
structure(list(ReportDate = logical(0), RL = logical(0), RLI = logical(0), 
    Identifier1 = logical(0), Identifier2 = logical(0), `2.1` = logical(0), 
    `2.2000000000000002` = logical(0), `3.1` = logical(0), `3.5` = logical(0), 
    `3.7` = logical(0), `3.8` = logical(0), `3.9` = logical(0), 
    `3.11` = logical(0), `3.12` = logical(0), `5.0999999999999996` = logical(0), 
    `5.2` = logical(0), `5.4` = logical(0)), .Names = c("ReportDate", 
"RL", "RLI", "Identifier1", "Identifier2", "2.1", "2.2000000000000002", 
"3.1", "3.5", "3.7", "3.8", "3.9", "3.11", "3.12", "5.0999999999999996", 
"5.2", "5.4"), row.names = integer(0), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000000000b0788>, sorted = c("ReportDate", 
"RL", "RLI"))

1 个答案:

答案 0 :(得分:0)

好的我觉得这就是你想要的。此解决方案使用dplyrpurrr

首先加载样本数据。

df1 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L,2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"), Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA, 34343L, NA, NA, 360000000L, 1000000000L, 13500000L), X2.2 = c(NA, NA, NA, NA, NA, NA, 520000000L, 270000000L, 178L), X3.1 = c(NA,  NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(540000, 3.02e+08, 150, NA, NA, NA, 11111111, 2323232, 102)), .Names = c("ReportDate", "RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, 
-9L))    
df2 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L,1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"),  Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,  1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA,  76000L, NA, NA, 13000000L, 13000000L, 24000L), X2.2 = c(NA,  NA, NA, NA, NA, NA, 90909090L, 325500L, 198000L), X3.1 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(1.6e+10, 2434340000,2.8e+10, NA, NA, NA, 500, 21000, 6.5e+10)), .Names = c("ReportDate","RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, -9L))
df3 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e"), class = "factor"), Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz"), class = "factor"),X3.7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7000000L, 650404040L), X3.8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, NA, NA, NA, NA, NA, 123456, 1.7e+11, NA, NA, 50004444, 50004444,1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA,NA, 1.7e+10, 2.8005e+10, NA, NA, 3e+09, 3e+09, 4e+09, 4e+09, 3.5e+09, 3.5e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.3434e+10, 4.3434e+10, NA, NA, 3870015600, 3762897490, 54545454, 7006666,9.3e+11, 7675030303)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", "X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L))    
df4 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e", "f"), class = "factor"), Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz", "yx" ), class = "factor"), X3.7 = c(NA, NA, NA, NA, NA, NA, NA,NA, NA, NA, NA, NA, NA, NA, 1900000L, 630404040L), X3.8 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, NA, NA, NA, NA, NA, 503456, 1.27e+11, NA, NA, 51004444, 51004444, 1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA, NA, 1.6e+10, 1.3005e+10, NA, NA, 3e+09, 4.3e+09, 4e+09, 4e+09, 2.8e+09, 2.8e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.4434e+10, 4.4434e+10, NA, NA, 4070015600, 3762897490, 54545454, 8006666, 9.3e+10, 7585030303)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9","X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L))    
df5 <- structure(list(ReportDate = structure(c(1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 1L), .Label = "cd", class = "factor"), Identifier1 = structure(c(2L, 1L, 2L), .Label = c("h", "j"), class = "factor"), Identifier2 = structure(c(1L, 2L, 2L), .Label = c("xz", "yx"), class = "factor"), X5.1 = c(656565L, 2340808L, NA), X5.2 = c(104L, NA, NA), X5.4 = c(64343L, NA, NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier1","Identifier2", "X5.1", "X5.2", "X5.4"), class = "data.frame", row.names = c(NA, -3L))

然后加载库并将所有五个数据帧放入列表中。

library(dplyr)
library(purrr)

dfs <- list("file1" = df1, "file2" = df2, "file3" = df3, "file4" = df4, "file5" = df5)

现在制作一个你最终想要加入的变量名的向量。

shared_vars <- names(dfs$file5[1:5])

因为五个数据帧并非都具有相同的列,而是一些缺少的列,例如加入时需要Identifier1,编写一个创建这些缺失列的函数,并在它们尚不存在的地方填充它们(填充从here改编的缺失列,并提供有关列类型转换的帮助{{ 3}})。

# function to create missing columns of joining variables where they don't already exist in a dataframe
make_missing_cols <- function(varnames, df) {
            if (sum(!varnames %in% names(df)) != 0) {
                new_df <- data.frame(df, setNames(as.list(rep(NA, sum(!varnames %in% names(df)))), setdiff(varnames, names(df))))
                # convert any new columns to factor (this will also change other logical columns to factors)
                new_df[sapply(new_df, is.logical)] <- lapply(new_df[sapply(new_df, is.logical)], as.factor)
                new_df[ ,order(colnames(new_df))] 
            } else {            
                new_df <- df[ , order(colnames(df))]
            }
}

现在将make_missing_cols函数应用于列表中的五个dfs中的每一个,以创建一个包含五个dfs的新列表,每个dfs现在都包含所有相同的列。

dfs_allcols <- 
    dfs %>% 
    map(~ make_missing_cols(varnames = shared_vars, df = .))

最后,将五个dfs加入一个df。未向by指定任何full_join参数会使dplyr对所有具有五个数据帧中的通用名称的变量进行连接。 arrange只对指定列上的outdf进行排序。 distinct仅保留唯一的行。

outdf <- 
    dfs_allcols %>% 
    reduce(full_join) %>% 
    arrange(ReportDate, RL, RLI, Identifier1, Identifier2) %>%
    distinct

outdf的快照:

# A tibble: 43 x 17
   Identifier1 Identifier2 ReportDate        RL   RLI     X2.1     X2.2   X3.1        X3.5 X3.11 X3.12
         <chr>       <chr>      <chr>     <chr> <chr>    <int>    <int> <fctr>       <dbl> <dbl> <dbl>
 1        <NA>          xy 01/12/2016 Service 1    ab       NA       NA     NA 16000000000    NA    NA
 2        <NA>          xy 01/12/2016 Service 1    ab       NA       NA     NA  2434340000    NA    NA
 3        <NA>          xy 01/12/2016 Service 1    ab       NA       NA     NA 28000000000    NA    NA
 4        <NA>          xy 01/12/2016 Service 1    ab    76000       NA     NA          NA    NA    NA
 5        <NA>          xy 01/12/2016 Service 1    ab       NA       NA     NA          NA    NA    NA
 6        <NA>          xy 01/12/2016 Service 2    ab       NA       NA     NA          NA    NA    NA
 7        <NA>          xy 01/12/2016 Service 2    cd 13000000 90909090     NA         500    NA    NA
 8        <NA>          xy 01/12/2016 Service 2    cd 13000000   325500     NA       21000    NA    NA
 9        <NA>          xy 01/12/2016 Service 2     f    24000   198000     NA 65000000000    NA    NA
10           h          xz 01/12/2016  Service2    ab       NA       NA     NA          NA    NA    NA
# ... with 33 more rows, and 6 more variables: X3.7 <int>, X3.8 <lgl>, X3.9 <dbl>, X5.1 <int>, X5.2 <int>,
#   X5.4 <int>

请注意,在此步骤之后,您可能需要对outdf进行一些修改以将变量提供给正确的列类型,尤其是因为make_missing_cols函数将任何逻辑列转换为因子类(用于连接目的) )。