如果已经在其他地方询问/回答过,请提前道歉,但我已经看了很多线程,到目前为止没有任何工作。
我正致力于将大量数据文件合并到一个数据库中。每个数据文件具有相同的前四个(有时五个)列名称,这些列名称是单词/字符,但在此之后,列名称用数字标记,并且是不同的(它们对于一组文件分别是相同的) ,但对于不同的文件集不同)。所以举一个例子,假设文件1有列a,b,c,d,1.1,1.2,2.3,文件2有列a,b,c,d,1.3,1.4,2.1,文件3有列a ,b,c,e,3.2,5.1。
然后每个文件包含不同数量的观察结果(第一列始终是报告日期)。一些观察是数字的,另一些是字符。 我想一次读取所有文件并将它们组合成一个数据框 (1)与其他文件共享的列合并为一个, (2)自动添加不同的列,并且 (3)对于那些在前四/五列中观察到的值(即报告日期和类似规范)全部相同的观察,观察结果输入同一行。例如,如果文件1和2在列a,b,c和d上相同,但文件1在列1.1,1.2,2.3中有观察,其余部分丢失,文件2在列a,b,c中有观察,d,1.3,1.4,2.1和其他没有,我希望这些观察只是添加在同一行。 (到目前为止,我能够做的最好的事情是为原始文件中的每一行设置一个单独的行,这导致我的最终结果主要由NAs /空单元组成,并且不是非常紧凑或可用。 )
我有大量的文件,每个文件的长度都不同,我想读取它们并使用循环一次合并它们。我到目前为止所管理的内容如下:
# packages
library(data.table)
library(plyr)
library(reshape)
library(dplyr)
#make a list of files in a folder and label it "filenames"
filenames <- list.files("path", full.names = T)
#read each element in "filenames" into R and label the resulting list "csvs"
csvs <- lapply(filenames, read.csv)
#merge all elements of "csvs" into one data table
merged.sheet = Reduce(function(...) merge(..., all=T), csvs)
#export table as csv
write.csv(merged.sheet, "path")
结果包含了我想要的所有数据,它只添加了一列,正如我所希望的那样(尽管列的顺序很奇怪,我不知道如何按照我想要的方式对它进行排序,加上R由于某种原因,为每个列名添加了一个X)。然而,它根本不紧凑,因为它只是将一行放在另一行之下,即使这些也可以合并,因为识别值(日期,类别等)是相同的,并且观察在不同的列中。
我已经玩了很多,并广泛搜索,但到目前为止没有任何工作。例如,我在合并之前尝试过setkey但是给出了一个错误,因为我最初阅读的是一个列表,而不是数据帧;我已经尝试了各种融合功能,但它们都返回了一个错误(当我指定了ID变量时,R告诉我它无法在数据中找到,即使我已经明确地复制了它们)或者没有& #39; t将我的编号列标识为ID并省略了相当多的数据(当我没有指定ID变量时)。我也尝试将参数传递给合并函数,例如通过=&#34; a&#34;,但这并没有给我我想要的结果。我试过=&#34; a&#34;,by.x =&#34; b&#34;和by.y =&#34; c&#34;,它返回了一条错误信息(关于长度的东西)论证不正确)。将几个参数传递给by也会返回错误(因为只允许一个唯一的列名)。
我是R的新手,无法想到其他任何事情。任何帮助将不胜感激!
编辑: 我已经创建了一些示例数据来说明我的数据集的样子。样本数据由5个文件组成。
EDIT2:以下是5个示例文件的结构:
dput(File1)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"
), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"),
Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA,
34343L, NA, NA, 360000000L, 1000000000L, 13500000L), X2.2 = c(NA,
NA, NA, NA, NA, NA, 520000000L, 270000000L, 178L), X3.1 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(540000, 3.02e+08,
150, NA, NA, NA, 11111111, 2323232, 102)), .Names = c("ReportDate",
"RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA,
-9L))
> dput(File2)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"
), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"),
Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA,
76000L, NA, NA, 13000000L, 13000000L, 24000L), X2.2 = c(NA,
NA, NA, NA, NA, NA, 90909090L, 325500L, 198000L), X3.1 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(1.6e+10, 2434340000,
2.8e+10, NA, NA, NA, 500, 21000, 6.5e+10)), .Names = c("ReportDate",
"RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA,
-9L))
> dput(File3)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "01/12/2016", class = "factor"),
RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"),
RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e"), class = "factor"),
Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),
Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz"), class = "factor"),
X3.7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 7000000L, 650404040L), X3.8 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA,
NA, NA, NA, NA, NA, 123456, 1.7e+11, NA, NA, 50004444, 50004444,
1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA,
NA, 1.7e+10, 2.8005e+10, NA, NA, 3e+09, 3e+09, 4e+09, 4e+09,
3.5e+09, 3.5e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.3434e+10,
4.3434e+10, NA, NA, 3870015600, 3762897490, 54545454, 7006666,
9.3e+11, 7675030303)), .Names = c("ReportDate", "RL", "RLI",
"Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", "X3.11",
"X3.12"), class = "data.frame", row.names = c(NA, -16L))
> dput(File4)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"),
RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"),
RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L,
4L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e", "f"), class = "factor"),
Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),
Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz", "yx"
), class = "factor"), X3.7 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, 1900000L, 630404040L), X3.8 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), X3.9 = c(NA, NA, NA, NA, NA, NA, 503456, 1.27e+11, NA,
NA, 51004444, 51004444, 1200000, 1200000, NA, NA), X3.11 = c(NA,
NA, NA, NA, NA, NA, 1.6e+10, 1.3005e+10, NA, NA, 3e+09, 4.3e+09,
4e+09, 4e+09, 2.8e+09, 2.8e+09), X3.12 = c(NA, NA, NA, NA,
NA, NA, 4.4434e+10, 4.4434e+10, NA, NA, 4070015600, 3762897490,
54545454, 8006666, 9.3e+10, 7585030303)), .Names = c("ReportDate",
"RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9",
"X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L
))
> dput(File5)
structure(list(ReportDate = structure(c(1L, 1L, 1L), .Label = "30/10/2016", class = "factor"),
RL = structure(c(1L, 1L, 1L), .Label = "Service2", class = "factor"),
RLI = structure(c(1L, 1L, 1L), .Label = "cd", class = "factor"),
Identifier1 = structure(c(2L, 1L, 2L), .Label = c("h", "j"
), class = "factor"), Identifier2 = structure(c(1L, 2L, 2L
), .Label = c("xz", "yx"), class = "factor"), X5.1 = c(656565L,
2340808L, NA), X5.2 = c(104L, NA, NA), X5.4 = c(64343L, NA,
NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier1",
"Identifier2", "X5.1", "X5.2", "X5.4"), class = "data.frame", row.names = c(NA,
-3L))
实际的数据集看起来或多或少像这样,除了变量有不同的名称和观察,有数百个文件,并且有更多的编号变量(变量之前的X由R添加,他们&# 39;不在原始的csv文件中)。 (另外,在示例数据中,我只添加了数字和空单元格,但原始数据集中的一些观察结果是字符。)
我希望将这些文件合并为一种,使结果尽可能紧凑。例如,文件5中的观察1应该与文件4中的观察1在同一行上,因为日期,RL,RLI和标识符1和2对于两者都是相同的,并且观察在不同的列中。但是如果日期或其他标识符之一不同,那么它们应该在不同的行上。
到目前为止,我的三个主要尝试如下:
# packages
library(data.table)
library(openxlsx)
library(plyr)
library(reshape)
library(dplyr)
library(tidyverse)
library(purrr)
##attempt 1
#make a list of files in a folder and label it "allFiles"
pathName <- "path"
allFiles <- list.files(pathName, full.names = T)
allFiles <- lapply(allFiles, read.csv)
#merge all elements of "allFiles" into one datatable
merged.sheet = Reduce(function(...) merge(..., all=T), allFiles)
这是目前为止最好的方法。 merged.sheet的结构是
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("30/10/2016",
"01/12/2016"), class = "factor"), RL = structure(c(1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Service 1",
"Service 2", "Service2"), class = "factor"), RLI = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L,
4L), .Label = c("ab", "cd", "f", "e"), class = "factor"), Identifier2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L), .Label = c("xy", "xz", "yx"), class = "factor"), Identifier1 = structure(c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L), .Label = c("h", "j"), class = "factor"), X3.7 = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 630404040L, NA, NA, 1900000L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
650404040L, NA, 7000000L, NA), X3.8 = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 503456, NA, NA,
1.27e+11, NA, NA, 51004444, NA, 1200000, 51004444, NA, 1200000,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 50004444, NA,
NA, 1.7e+11, 123456, NA, NA, NA, 50004444, NA, 1200000, NA, 1200000
), X3.11 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 1.6e+10, NA, NA, 1.3005e+10, NA, NA, 3e+09, 2.8e+09, 4e+09,
4.3e+09, 2.8e+09, 4e+09, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 3e+09, NA, NA, 2.8005e+10, 1.7e+10, NA, NA, NA, 3e+09,
3.5e+09, 4e+09, 3.5e+09, 4e+09), X3.12 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 4.4434e+10, NA, NA, 4.4434e+10,
NA, NA, 4070015600, 7585030303, 8006666, 3762897490, 9.3e+10,
54545454, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3870015600,
NA, NA, 4.3434e+10, 4.3434e+10, NA, NA, NA, 3762897490, 7675030303,
7006666, 9.3e+11, 54545454), X2.1 = c(34343L, NA, NA, NA, NA,
NA, 360000000L, 1000000000L, 13500000L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 76000L,
NA, 13000000L, 13000000L, 24000L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), X2.2 = c(NA, NA, NA, NA,
NA, NA, 520000000L, 270000000L, 178L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
325500L, 90909090L, 198000L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), X3.1 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(NA,
150, 540000, 3.02e+08, NA, NA, 11111111, 2323232, 102, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2434340000,
1.6e+10, 2.8e+10, NA, NA, NA, 21000, 500, 6.5e+10, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X5.1 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 656565L, 656565L,
656565L, 2340808L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), X5.2 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 104L, 104L, 104L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X5.4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 64343L, 64343L,
64343L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier2",
"Identifier1", "X3.7", "X3.8", "X3.9", "X3.11", "X3.12", "X2.1",
"X2.2", "X3.1", "X3.5", "X5.1", "X5.2", "X5.4"), row.names = c(NA,
-50L), class = "data.frame")
主要问题是数据库不是很紧凑。即使所有标识符都相同,原始文件中单独行上的所有内容都位于合并工作表中的单独行上。这使得数据集在用实际数据完成时非常大,并且大多数单元格是NA。
##attempt 2
pathName <- "path"
allFiles <- list.files(pathName, full.names = T)
allFiles <- lapply(allFiles, read.csv)
#df <- allFiles %>% purrr::reduce(dplyr::left_join, by = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2"))
df <- allFiles %>% purrr::reduce(dplyr::left_join, by = c("ReportDate", "RL", "RLI"))
这没有用。我不知道到底出了什么问题,但df中缺少大部分数据,而且我不知道该结构会产生什么结果:
structure(list(ReportDate = c("30/10/2016", "30/10/2016", "30/10/2016",
"30/10/2016", "30/10/2016", "30/10/2016", "30/10/2016", "30/10/2016",
"30/10/2016"), RL = c("Service 1", "Service 1", "Service 1",
"Service 1", "Service 1", "Service 2", "Service 2", "Service 2",
"Service 2"), RLI = c("ab", "ab", "ab", "ab", "ab", "ab", "cd",
"cd", "f"), Identifier2.x = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "xy", class = "factor"), X2.1.x = c(NA,
NA, NA, 34343L, NA, NA, 360000000L, 1000000000L, 13500000L),
X2.2.x = c(NA, NA, NA, NA, NA, NA, 520000000L, 270000000L,
178L), X3.1.x = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5.x = c(540000,
3.02e+08, 150, NA, NA, NA, 11111111, 2323232, 102), Identifier2.y = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = "xy", class = "factor"),
X2.1.y = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), X2.2.y = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), X3.1.y = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5.y = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Identifier1.x = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
), class = "factor"), Identifier2.x.x = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("xy",
"xz"), class = "factor"), X3.7.x = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X3.8.x = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA), X3.9.x = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), X3.11.x = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), X3.12.x = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Identifier1.y = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
), class = "factor"), Identifier2.y.y = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("xy",
"xz", "yx"), class = "factor"), X3.7.y = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X3.8.y = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA), X3.9.y = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), X3.11.y = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), X3.12.y = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Identifier1 = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
), class = "factor"), Identifier2 = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("xz",
"yx"), class = "factor"), X5.1 = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X5.2 = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X5.4 = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), class = "data.frame", row.names = c(NA,
-9L), .Names = c("ReportDate", "RL", "RLI", "Identifier2.x",
"X2.1.x", "X2.2.x", "X3.1.x", "X3.5.x", "Identifier2.y", "X2.1.y",
"X2.2.y", "X3.1.y", "X3.5.y", "Identifier1.x", "Identifier2.x.x",
"X3.7.x", "X3.8.x", "X3.9.x", "X3.11.x", "X3.12.x", "Identifier1.y",
"Identifier2.y.y", "X3.7.y", "X3.8.y", "X3.9.y", "X3.11.y", "X3.12.y",
"Identifier1", "Identifier2", "X5.1", "X5.2", "X5.4"))
这是我的最后一次尝试:
##attempt 3 (Table to be read in is an Excel spreadsheet with the variable names ReportDate and so on as the top row.)
Table <- read.xlsx("path")
Table1 <- as.data.table(Table)
pathName <- "path"
allFiles <- list.files(pathName, full.names = T)
for(i in 1:length(allFiles)) {
dt <- read.csv(allFiles[i])
dt1 <- as.data.table(dt)
#set keys
setkey(Table1, "ReportDate", "RL", "RLI", "Identifier1", "Identifier2")
setkey(dt1, "ReportDate", "RL", "RLI", "Identifier1", "Identifier2")
NewTable <- merge(Table1, dt1, all=TRUE)
return(NewTable)
rm(dt1)
}
尝试3返回以下错误消息:
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
some columns are not in the data.table: Identifier1
这本身就是一个问题。我尝试合并的一些数据文件包含Identifier1,有些包含Identifier2,有些包含两者。我需要将它们用作密钥,这样,如果相关文件中存在相同的标识符,则数据将仅合并到一行上,并且对于应该在同一行上的观察值包含相同的值。但是,我的函数只允许合并所有文件中存在的键,它会出现。 但是,目前我没有这些密钥重新运行代码:
##attempt 4
Table <- read.xlsx("path")
Table1 <- as.data.table(Table)
pathName <- "path"
allFiles <- list.files(pathName, full.names = T)
for(i in 1:length(allFiles)) {
dt <- read.csv(allFiles[i])
dt1 <- as.data.table(dt)
#set keys
setkey(Table1, "ReportDate", "RL", "RLI")
setkey(dt1, "ReportDate", "RL", "RLI")
NewTable <- merge(Table1, dt1, all=TRUE)
return(NewTable)
rm(dt1)
}
这又返回了另一个错误:
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, :
x.'ReportDate' is a factor column being joined to i.'ReportDate' which is type 'logical'. Factor columns must join to factor or character columns.
EDIT3:来自尝试3和4的表1的结构如下:
dput(Table1)
structure(list(ReportDate = logical(0), RL = logical(0), RLI = logical(0),
Identifier1 = logical(0), Identifier2 = logical(0), `2.1` = logical(0),
`2.2000000000000002` = logical(0), `3.1` = logical(0), `3.5` = logical(0),
`3.7` = logical(0), `3.8` = logical(0), `3.9` = logical(0),
`3.11` = logical(0), `3.12` = logical(0), `5.0999999999999996` = logical(0),
`5.2` = logical(0), `5.4` = logical(0)), .Names = c("ReportDate",
"RL", "RLI", "Identifier1", "Identifier2", "2.1", "2.2000000000000002",
"3.1", "3.5", "3.7", "3.8", "3.9", "3.11", "3.12", "5.0999999999999996",
"5.2", "5.4"), row.names = integer(0), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000000000b0788>, sorted = c("ReportDate",
"RL", "RLI"))
答案 0 :(得分:0)
好的我觉得这就是你想要的。此解决方案使用dplyr
和purrr
。
首先加载样本数据。
df1 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L,2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"), Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA, 34343L, NA, NA, 360000000L, 1000000000L, 13500000L), X2.2 = c(NA, NA, NA, NA, NA, NA, 520000000L, 270000000L, 178L), X3.1 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(540000, 3.02e+08, 150, NA, NA, NA, 11111111, 2323232, 102)), .Names = c("ReportDate", "RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA,
-9L))
df2 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L,1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"), Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA, 76000L, NA, NA, 13000000L, 13000000L, 24000L), X2.2 = c(NA, NA, NA, NA, NA, NA, 90909090L, 325500L, 198000L), X3.1 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(1.6e+10, 2434340000,2.8e+10, NA, NA, NA, 500, 21000, 6.5e+10)), .Names = c("ReportDate","RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, -9L))
df3 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e"), class = "factor"), Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz"), class = "factor"),X3.7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7000000L, 650404040L), X3.8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, NA, NA, NA, NA, NA, 123456, 1.7e+11, NA, NA, 50004444, 50004444,1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA,NA, 1.7e+10, 2.8005e+10, NA, NA, 3e+09, 3e+09, 4e+09, 4e+09, 3.5e+09, 3.5e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.3434e+10, 4.3434e+10, NA, NA, 3870015600, 3762897490, 54545454, 7006666,9.3e+11, 7675030303)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", "X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L))
df4 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e", "f"), class = "factor"), Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz", "yx" ), class = "factor"), X3.7 = c(NA, NA, NA, NA, NA, NA, NA,NA, NA, NA, NA, NA, NA, NA, 1900000L, 630404040L), X3.8 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, NA, NA, NA, NA, NA, 503456, 1.27e+11, NA, NA, 51004444, 51004444, 1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA, NA, 1.6e+10, 1.3005e+10, NA, NA, 3e+09, 4.3e+09, 4e+09, 4e+09, 2.8e+09, 2.8e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.4434e+10, 4.4434e+10, NA, NA, 4070015600, 3762897490, 54545454, 8006666, 9.3e+10, 7585030303)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9","X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L))
df5 <- structure(list(ReportDate = structure(c(1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 1L), .Label = "cd", class = "factor"), Identifier1 = structure(c(2L, 1L, 2L), .Label = c("h", "j"), class = "factor"), Identifier2 = structure(c(1L, 2L, 2L), .Label = c("xz", "yx"), class = "factor"), X5.1 = c(656565L, 2340808L, NA), X5.2 = c(104L, NA, NA), X5.4 = c(64343L, NA, NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier1","Identifier2", "X5.1", "X5.2", "X5.4"), class = "data.frame", row.names = c(NA, -3L))
然后加载库并将所有五个数据帧放入列表中。
library(dplyr)
library(purrr)
dfs <- list("file1" = df1, "file2" = df2, "file3" = df3, "file4" = df4, "file5" = df5)
现在制作一个你最终想要加入的变量名的向量。
shared_vars <- names(dfs$file5[1:5])
因为五个数据帧并非都具有相同的列,而是一些缺少的列,例如加入时需要Identifier1
,编写一个创建这些缺失列的函数,并在它们尚不存在的地方填充它们(填充从here改编的缺失列,并提供有关列类型转换的帮助{{ 3}})。
# function to create missing columns of joining variables where they don't already exist in a dataframe
make_missing_cols <- function(varnames, df) {
if (sum(!varnames %in% names(df)) != 0) {
new_df <- data.frame(df, setNames(as.list(rep(NA, sum(!varnames %in% names(df)))), setdiff(varnames, names(df))))
# convert any new columns to factor (this will also change other logical columns to factors)
new_df[sapply(new_df, is.logical)] <- lapply(new_df[sapply(new_df, is.logical)], as.factor)
new_df[ ,order(colnames(new_df))]
} else {
new_df <- df[ , order(colnames(df))]
}
}
现在将make_missing_cols
函数应用于列表中的五个dfs中的每一个,以创建一个包含五个dfs的新列表,每个dfs现在都包含所有相同的列。
dfs_allcols <-
dfs %>%
map(~ make_missing_cols(varnames = shared_vars, df = .))
最后,将五个dfs加入一个df。未向by
指定任何full_join
参数会使dplyr
对所有具有五个数据帧中的通用名称的变量进行连接。 arrange
只对指定列上的outdf
进行排序。 distinct
仅保留唯一的行。
outdf <-
dfs_allcols %>%
reduce(full_join) %>%
arrange(ReportDate, RL, RLI, Identifier1, Identifier2) %>%
distinct
outdf
的快照:
# A tibble: 43 x 17
Identifier1 Identifier2 ReportDate RL RLI X2.1 X2.2 X3.1 X3.5 X3.11 X3.12
<chr> <chr> <chr> <chr> <chr> <int> <int> <fctr> <dbl> <dbl> <dbl>
1 <NA> xy 01/12/2016 Service 1 ab NA NA NA 16000000000 NA NA
2 <NA> xy 01/12/2016 Service 1 ab NA NA NA 2434340000 NA NA
3 <NA> xy 01/12/2016 Service 1 ab NA NA NA 28000000000 NA NA
4 <NA> xy 01/12/2016 Service 1 ab 76000 NA NA NA NA NA
5 <NA> xy 01/12/2016 Service 1 ab NA NA NA NA NA NA
6 <NA> xy 01/12/2016 Service 2 ab NA NA NA NA NA NA
7 <NA> xy 01/12/2016 Service 2 cd 13000000 90909090 NA 500 NA NA
8 <NA> xy 01/12/2016 Service 2 cd 13000000 325500 NA 21000 NA NA
9 <NA> xy 01/12/2016 Service 2 f 24000 198000 NA 65000000000 NA NA
10 h xz 01/12/2016 Service2 ab NA NA NA NA NA NA
# ... with 33 more rows, and 6 more variables: X3.7 <int>, X3.8 <lgl>, X3.9 <dbl>, X5.1 <int>, X5.2 <int>,
# X5.4 <int>
请注意,在此步骤之后,您可能需要对outdf
进行一些修改以将变量提供给正确的列类型,尤其是因为make_missing_cols
函数将任何逻辑列转换为因子类(用于连接目的) )。