我有一组公司的面板数据:
df <- structure(list(id = c("00127264", "00127264", "00127264", "00127264",
"00127264", "00127264", "00127264", "00127264", "00127264", "00127264",
"00127264", "00127264", "00127264", "00127264", "00127264", "00128538",
"00128538", "00128538", "00128538", "00128538", "00128538", "00128538",
"00128538", "00128538", "00128538", "00129879", "00129879", "00129879",
"00129879", "00129879", "00129879", "00129879", "00129879", "00129879",
"00129879", "00132241", "00132241", "00132241", "00132241", "00132241",
"00132241", "00132241", "00132241", "00132241", "00132241", "00132241",
"00132241", "00132241", "00132241", "00132241"), time = c(2003L,
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L,
2013L, 2014L, 2015L, 2016L, 2017L, 2008L, 2009L, 2010L, 2011L,
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2003L, 2004L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2003L, 2004L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L,
2014L, 2015L, 2016L, 2017L), sales = c(18778913, 26246705, 24577605,
20555975, 22803119, 30493587, 47409381, 39648917, 24164698, 26667934,
36939340, 37303488, 36095594, 47863204, 81470728, 17082948, 19218374,
17775729, 18719393, 17682127, 17648132, 19868021, 20034845, 20291386,
28511274, 23842198, 33364335, 38006554, 44051316, 41017519, 44559215,
38096697, 39532944, 32250063, 20456725, 36737613, 36788480, 34432314,
45703706, 51318203, 57966879, 57314960, 69108257, 83337772, 95862115,
78796350, 73897366, 122529286, 114051176, 140727472), costs = c(2776879,
6661626, 7383728, 8148280, 6965171, 15952938, 28537059, 20336344,
8049578, 8313115, 17175621, 17864169, 17323966, 25772512, 56918048,
13617240, 14974971, 13919060, 14317811, 13879155, 14374214, 14607183,
14718348, 15511957, 22142396, 21523985, 30354647, 33001065, 38699618,
35369730, 50308253, 37174212, 38743973, 28852158, 16476830, 31420842,
30050214, 28193685, 35918673, 40847638, 45944119, 44448831, 56898404,
70216220, 80454840, 63808983, 60155914, 106046623, 96525104,
119211752)), row.names = c(NA, -50L), class = c("tbl_df", "tbl",
"data.frame"))
如您所见,它有4列:id,时间,销售和成本。 我想计算所有公司的销售和成本之间的相关性。例如,我想计算ID为00127264的公司的销售额与所有其他公司的成本(“ 00128538”,“ 00129879”,“ 00132241”)之间的相关性。关联应考虑时间维度。面板数据集不平衡。
我在这里找到了类似的问题和解决方案:
Correlation matrix in panel data in R
但是widyr
包只能计算一个值变量的相关性:
widyr::pairwise_cor(sample, id, year, sales)
我需要类似的东西
widyr::pairwise_cor(sample, id, year, c(sales, costs))
这是不可能的。 预期的输出(相关性只是一些随机数):
从到错误
127264 128538 0,54
127264 129879 0,68
127264 132241 0,78
128538 127264 0,43
128538 129879 0,48
128538 132241 0,17
129879 127264 0,57
129879 128538 0,36
129879 132241 0,89
132241 127264 0,15
132241 128538 0,6
132241 129879 0,8
或者它可以是一个相关矩阵,如我所提到的。