我将与您分享我庞大数据集的简化版本。这个简化版完全尊重我的原始数据集的结构,但包含的列表元素,数据框,变量和观察结果比原始数据集少。
根据对该问题的最热烈回答:How to make a great R reproducible example ?,我使用dput(query1)
的输出共享我的数据集,通过复制/粘贴以下代码为您提供可立即在R中使用的内容在R控制台中阻止:
structure(list(plu = structure(list(year = structure(list(id = 1:3,
station = 100:102, pluMean = c(0.509068994778059, 1.92866478959912,
1.09517453602154), pluMax = c(0.0146962179957886, 0.802984389130343,
2.48170762478472)), .Names = c("id", "station", "pluMean",
"pluMax"), row.names = c(NA, -3L), class = "data.frame"), month = structure(list(
id = 1:3, station = 100:102, pluMean = c(0.66493845927034,
-1.3559338786041, 0.195600637750077), pluMax = c(0.503424623872161,
0.234402501255681, -0.440264545434053)), .Names = c("id",
"station", "pluMean", "pluMax"), row.names = c(NA, -3L), class = "data.frame"),
week = structure(list(id = 1:3, station = 100:102, pluMean = c(-0.608295829330578,
-1.10256919591373, 1.74984007126193), pluMax = c(0.969668266601551,
0.924426323739882, 3.47460867665884)), .Names = c("id", "station",
"pluMean", "pluMax"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("year",
"month", "week")), tsa = structure(list(year = structure(list(
id = 1:3, station = 100:102, tsaMean = c(-1.49060721773042,
-0.684735418997484, 0.0586655881113975), tsaMax = c(0.25739838787582,
0.957634817758648, 1.37198023881125)), .Names = c("id", "station",
"tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame"),
month = structure(list(id = 1:3, station = 100:102, tsaMean = c(-0.684668662999479,
-1.28087846387974, -0.600175481941456), tsaMax = c(0.962916941685075,
0.530773351897188, -0.217143593955998)), .Names = c("id",
"station", "tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame"),
week = structure(list(id = 1:3, station = 100:102, tsaMean = c(0.376481732842365,
0.370435880636005, -0.105354927593471), tsaMax = c(1.93833635147645,
0.81176751708868, 0.744932493064975)), .Names = c("id", "station",
"tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("year",
"month", "week"))), .Names = c("plu", "tsa"))
执行此操作后,如果执行str(query1),
,您将获得我的示例数据集的结构:
> str(query1)
List of 2
$ plu:List of 3
..$ year :'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ pluMean: num [1:3] 0.509 1.929 1.095
.. ..$ pluMax : num [1:3] 0.0147 0.803 2.4817
..$ month:'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ pluMean: num [1:3] 0.665 -1.356 0.196
.. ..$ pluMax : num [1:3] 0.503 0.234 -0.44
..$ week :'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ pluMean: num [1:3] -0.608 -1.103 1.75
.. ..$ pluMax : num [1:3] 0.97 0.924 3.475
$ tsa:List of 3
..$ year :'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ tsaMean: num [1:3] -1.4906 -0.6847 0.0587
.. ..$ tsaMax : num [1:3] 0.257 0.958 1.372
..$ month:'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ tsaMean: num [1:3] -0.685 -1.281 -0.6
.. ..$ tsaMax : num [1:3] 0.963 0.531 -0.217
..$ week :'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ tsaMean: num [1:3] 0.376 0.37 -0.105
.. ..$ tsaMax : num [1:3] 1.938 0.812 0.745
那怎么读?我有大列表(query1
)由2个参数元素(plu
& tsa
)组成,每个<2> em> parameters 元素是由3个元素组成的列表(year
,month
,week
),这3个元素中的每一个都是 timeInterval 数据框由相同的4个变量列(id
,station
,mean
,max
)和完全相同的观察数量组成{{{ 1}})。
我想通过3
&amp;编辑以编程方式 full_join
id
具有相同名称的所有 timeInterval 数据框(station
,year
,month
)。这意味着我最终会得到一个包含3个数据框(week
,query1Changed
,year
)的新列表(month
),每个数据框包含5列({{{ 1}},week
,id
,station
,pluMean
,pluMax
)和3个观察结果。原理上,我需要按如下方式安排数据:
按电台和身份ID执行full_join:
tsaMean
与df tsaMax
query1$plu$year
与df query1$tsa$year
query1$plu$month
与df query1$tsa$month
或用另一种表示形式表达:
query1$plu$week
与df query1$tsa$week
query1[[1]][[1]]
与df query1[[2]][[1]]
query1[[1]][[2]]
与df query1[[2]][[2]]
以编程方式表达(n是大列表中元素的总数):
query1[[1]][[3]]
与df query1[[2]][[3]]
...与df query1[[i]][[1]]
query1[[i+1]][[1]]
与df query1[[n]][[1]]
...与df query1[[i]][[2]]
query1[[i+1]][[2]]
与df query1[[n]][[2]]
...与df query1[[i]][[3]]
我需要以编程方式实现这一点,因为在我的真实项目中,我可能会遇到另外一个大列表,其中包含超过2个参数元素和超过4个变量每个 timeIntervals 数据框中的em>列。
在我的分析中,总是保持不变的是另一个大列表的所有参数元素将始终具有相同数量的 timeIntervals 具有相同名称的数据框和每个 timeIntervals 数据框将始终具有相同数量的观察值,并始终共享具有完全相同名称和相同值的两列(query1[[i+1]][[3]]
&amp; ; query1[[n]][[3]]
)
执行以下代码:
id
按预期排列数据。然而,这不是一个简洁的解决方案,因为我们最终得到了重复的列名(station
&amp; > query1Changed <- do.call(function(...) mapply(bind_cols, ..., SIMPLIFY=F), args = query1)
):
id
我们可以添加第二个流程来“清理”数据,但这不是最有效的解决方案。所以我不想使用这种解决方法。
接下来,我尝试使用dplyr full_join做同样的事情,但没有成功。执行以下代码:
station
返回以下错误:
> str(query1Changed)
List of 3
$ year :'data.frame': 3 obs. of 8 variables:
..$ id : int [1:3] 1 2 3
..$ station : int [1:3] 100 101 102
..$ pluMean : num [1:3] 0.509 1.929 1.095
..$ pluMax : num [1:3] 0.0147 0.803 2.4817
..$ id1 : int [1:3] 1 2 3
..$ station1: int [1:3] 100 101 102
..$ tsaMean : num [1:3] -1.4906 -0.6847 0.0587
..$ tsaMax : num [1:3] 0.257 0.958 1.372
$ month:'data.frame': 3 obs. of 8 variables:
..$ id : int [1:3] 1 2 3
..$ station : int [1:3] 100 101 102
..$ pluMean : num [1:3] 0.665 -1.356 0.196
..$ pluMax : num [1:3] 0.503 0.234 -0.44
..$ id1 : int [1:3] 1 2 3
..$ station1: int [1:3] 100 101 102
..$ tsaMean : num [1:3] -0.685 -1.281 -0.6
..$ tsaMax : num [1:3] 0.963 0.531 -0.217
$ week :'data.frame': 3 obs. of 8 variables:
..$ id : int [1:3] 1 2 3
..$ station : int [1:3] 100 101 102
..$ pluMean : num [1:3] -0.608 -1.103 1.75
..$ pluMax : num [1:3] 0.97 0.924 3.475
..$ id1 : int [1:3] 1 2 3
..$ station1: int [1:3] 100 101 102
..$ tsaMean : num [1:3] 0.376 0.37 -0.105
..$ tsaMax : num [1:3] 1.938 0.812 0.745
那么,我应该如何编写full_join表达式以使其在数据帧上运行?
还是有其他方法可以有效地进行数据转换吗?
我找到了相关的问题,但我仍然无法弄清楚如何使他们的解决方案适应我的问题。
在stackoverflow上: - Merging a data frame from a list of data frames [duplicate] - Simultaneously merge multiple data.frames in a list - Joining list of data.frames from map() call - Combining elements of list of lists by index
在博客上: - Joining a List of Data Frames with purrr::reduce()
任何帮助将不胜感激。我希望我已经清楚地描述了我的问题。 我在2个月前开始使用R编程,所以如果解决方案很明显,请放纵;)
答案 0 :(得分:4)
首先,感谢您发布了一个非常好的描述,说明您的问题是什么以及您的解决方案需要哪些要求。
首先,我使用purrr::map2
创建一个函数,该函数接收两个数据帧列表并将它们并行连接。也就是说,它将plu
的第一个数据框与tsa
的第一个数据框... plu
的最后一个tsa
连接起来,并将结果返回为一个清单。
> join_each = function(x, y) map2(x, y, full_join)
> join_each(query1$plu, query1$tsa)
Joining, by = c("id", "station")
Joining, by = c("id", "station")
Joining, by = c("id", "station")
$year
id station pluMean pluMax tsaMean tsaMax
1 1 100 0.509069 0.01469622 -1.49060722 0.2573984
2 2 101 1.928665 0.80298439 -0.68473542 0.9576348
3 3 102 1.095175 2.48170762 0.05866559 1.3719802
$month
id station pluMean pluMax tsaMean tsaMax
1 1 100 0.6649385 0.5034246 -0.6846687 0.9629169
2 2 101 -1.3559339 0.2344025 -1.2808785 0.5307734
3 3 102 0.1956006 -0.4402645 -0.6001755 -0.2171436
$week
id station pluMean pluMax tsaMean tsaMax
1 1 100 -0.6082958 0.9696683 0.3764817 1.9383364
2 2 101 -1.1025692 0.9244263 0.3704359 0.8117675
3 3 102 1.7498401 3.4746087 -0.1053549 0.7449325
嗯,这只有两个,但是当你有n个data.frames列表时,你想让它工作。现在您将需要purrr::reduce
:
> reduce(query1, join_each)
Joining, by = c("id", "station")
Joining, by = c("id", "station")
Joining, by = c("id", "station")
$year
id station pluMean pluMax tsaMean tsaMax
1 1 100 0.509069 0.01469622 -1.49060722 0.2573984
2 2 101 1.928665 0.80298439 -0.68473542 0.9576348
3 3 102 1.095175 2.48170762 0.05866559 1.3719802
$month
id station pluMean pluMax tsaMean tsaMax
1 1 100 0.6649385 0.5034246 -0.6846687 0.9629169
2 2 101 -1.3559339 0.2344025 -1.2808785 0.5307734
3 3 102 0.1956006 -0.4402645 -0.6001755 -0.2171436
$week
id station pluMean pluMax tsaMean tsaMax
1 1 100 -0.6082958 0.9696683 0.3764817 1.9383364
2 2 101 -1.1025692 0.9244263 0.3704359 0.8117675
3 3 102 1.7498401 3.4746087 -0.1053549 0.7449325
它计算join_each(query1[[1]], query1[[2]]) %>% join_each(query1[[3]]) ... %>% join_each(query1[[n]])
。
更新:以下一行内容相同:reduce(query1, map2, full_join)
。但它并不具有可读性。