我有一个重复数据的数据框(下面的示例)。第一列是从中获取样本数据的数据。我想创建一个表来计算日期之间每个可能的交集中的重叠数。换句话说,以2014.01.14为例,我想看看2014.07.14样本中,2014.04.14样本中有多少元素,等等。
我一直在查看setdiff和calculate.overlap函数(来自dplyr和维恩图函数),但还没有找到一种有效的方法来进行这些多重集重叠。
非常感谢任何帮助 - 谢谢。
编辑: 期望的输出是总结所有重叠组合的表 - 所以x乘x网格,其中x是不同日期的数量,元素是每个组合中的重叠数。换句话说,这个表的对角线都是100%,因为它将自己比较集合。
structure(list(.id = c("2016.10.14", "2016.07.14", "2014.04.14",
"2014.01.14", "2017.01.06", "2016.10.14", "2016.07.14", "2016.04.14",
"2016.07.14", "2016.04.14", "2016.01.14", "2015.10.14", "2015.07.14",
"2015.04.14", "2015.01.14", "2014.10.14", "2017.03.06", "2017.01.06",
"2016.10.14", "2016.07.14", "2016.04.14", "2016.01.14", "2015.04.14",
"2015.01.14", "2014.10.14", "2014.07.14", "2014.04.14", "2014.01.14",
"2016.07.14", "2016.04.14", "2016.01.14", "2015.10.14", "2015.07.14",
"2015.04.14", "2015.04.14", "2016.01.14", "2015.10.14", "2015.07.14",
"2015.04.14", "2015.01.14", "2014.10.14", "2014.07.14", "2014.04.14",
"2014.01.14", "2017.03.06", "2017.01.06", "2016.10.14", "2016.07.14",
"2015.01.14", "2014.10.14", "2014.07.14", "2014.04.14", "2014.01.14",
"2016.04.14", "2016.01.14", "2017.03.06", "2017.01.06", "2016.10.14",
"2016.07.14", "2016.04.14", "2016.01.14", "2015.10.14", "2015.07.14",
"2015.04.14", "2015.01.14", "2014.10.14", "2014.07.14", "2014.04.14",
"2014.01.14", "2017.03.06", "2017.01.06", "2015.10.14", "2015.07.14",
"2017.03.06", "2017.01.06", "2015.10.14", "2015.07.14", "2015.04.14",
"2015.01.14", "2014.10.14", "2014.07.14", "2014.04.14", "2014.01.14",
"2016.01.14", "2015.10.14", "2015.07.14", "2015.04.14", "2015.01.14",
"2014.10.14", "2015.07.14", "2017.03.06", "2017.01.06", "2016.10.14",
"2016.07.14", "2016.04.14", "2016.01.14", "2015.10.14", "2015.07.14",
"2015.04.14", "2015.01.14"), Name = c("ZIMMER BIOMET HOLDINGS INC",
"ZIMMER BIOMET HOLDINGS INC", "ZIMMER BIOMET HOLDINGS INC", "ZIMMER BIOMET HOLDINGS INC",
"XYLEM INC", "XYLEM INC", "XYLEM INC", "XYLEM INC", "XL GROUP LTD",
"XL GROUP LTD", "XL GROUP LTD", "XL GROUP LTD", "XL GROUP LTD",
"XL GROUP LTD", "XL GROUP LTD", "XL GROUP LTD", "XCEL ENERGY INC",
"XCEL ENERGY INC", "XCEL ENERGY INC", "XCEL ENERGY INC", "XCEL ENERGY INC",
"XCEL ENERGY INC", "XCEL ENERGY INC", "XCEL ENERGY INC", "XCEL ENERGY INC",
"XCEL ENERGY INC", "XCEL ENERGY INC", "XCEL ENERGY INC", "WILLIS TOWERS WATSON PLC",
"WILLIS TOWERS WATSON PLC", "WEYERHAEUSER CO", "WEYERHAEUSER CO",
"WEYERHAEUSER CO", "WEYERHAEUSER CO", "WELLTOWER INC", "WELLS FARGO & CO",
"WELLS FARGO & CO", "WELLS FARGO & CO", "WELLS FARGO & CO", "WELLS FARGO & CO",
"WELLS FARGO & CO", "WELLS FARGO & CO", "WELLS FARGO & CO", "WELLS FARGO & CO",
"WEC ENERGY GROUP INC", "WEC ENERGY GROUP INC", "WEC ENERGY GROUP INC",
"WEC ENERGY GROUP INC", "WEC ENERGY GROUP INC", "WEC ENERGY GROUP INC",
"WEC ENERGY GROUP INC", "WEC ENERGY GROUP INC", "WEC ENERGY GROUP INC",
"WATERS CORP", "WATERS CORP", "WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC",
"WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC",
"WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC",
"WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC",
"WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC", "WASTE MANAGEMENT INC",
"WALT DISNEY CO/THE", "WALT DISNEY CO/THE", "WALT DISNEY CO/THE",
"WALT DISNEY CO/THE", "WAL-MART STORES INC", "WAL-MART STORES INC",
"WAL-MART STORES INC", "WAL-MART STORES INC", "WAL-MART STORES INC",
"WAL-MART STORES INC", "WAL-MART STORES INC", "WAL-MART STORES INC",
"WAL-MART STORES INC", "WAL-MART STORES INC", "VORNADO REALTY TRUST",
"VORNADO REALTY TRUST", "VORNADO REALTY TRUST", "VORNADO REALTY TRUST",
"VORNADO REALTY TRUST", "VORNADO REALTY TRUST", "VF CORP", "VERIZON COMMUNICATIONS INC",
"VERIZON COMMUNICATIONS INC", "VERIZON COMMUNICATIONS INC", "VERIZON COMMUNICATIONS INC",
"VERIZON COMMUNICATIONS INC", "VERIZON COMMUNICATIONS INC", "VERIZON COMMUNICATIONS INC",
"VERIZON COMMUNICATIONS INC", "VERIZON COMMUNICATIONS INC", "VERIZON COMMUNICATIONS INC"
), X..Wgt..P. = c(0.919761358, 0.933057358, 0.864039801, 0.986472727,
0.818192145, 0.964415732, 0.937955303, 0.975205821, 0.842865506,
1.126873419, 1.112947401, 1.099063977, 1.164558843, 1.151605792,
0.988825771, 0.918127246, 0.961678801, 1.033232496, 1.071266028,
1.058801025, 0.93020876, 1.006470014, 0.964550593, 1.125389358,
1.124178769, 1.059881818, 1.085759766, 1.0200896, 0.909300947,
1.012656226, 0.871314497, 0.94065386, 0.936539398, 0.89578771,
0.970174408, 0.872847521, 0.949033595, 1.002482323, 0.969762291,
0.942347801, 0.989303579, 0.98593113, 0.968543074, 0.950396767,
0.92842183, 0.972461108, 1.006570516, 1.005416119, 0.993368064,
1.002700203, 0.958855231, 1.096422022, 1.021761901, 0.963388913,
0.926948746, 1.304745321, 1.199632088, 1.234399398, 1.269585556,
1.087885176, 1.189337133, 1.146522298, 1.053416881, 1.16547938,
1.096877887, 1.018844619, 1.014660123, 0.999372549, 1.032241197,
1.063327494, 1.00757917, 0.963540352, 0.944914081, 0.896543743,
0.80520587, 0.841142125, 0.921557168, 0.959306858, 1.278648475,
1.299562929, 1.191759169, 1.222574274, 1.082399965, 0.943414727,
0.998604251, 0.94817809, 1.097301363, 1.220343667, 0.931967545,
0.900281393, 1.018445806, 1.18929386, 1.085480841, 1.13721209,
1.070562566, 1.135912005, 1.0973153, 1.025256697, 1.008197447,
0.907933413), Mkt.Val..P. = c(60495554.15, 74208256.58, 33088060.26,
36688942.4, 52333970.68, 63432610.68, 74597801.67, 66667061.11,
67035085.44, 77035367.75, 61854260.94, 52447251.15, 55756621.76,
59977265.12, 52416987.1, 41839037.8, 62294332.68, 66088582.56,
70460485.65, 84208947.55, 63590970.16, 55936568.82, 50235164.73,
59656130.73, 51228801, 43818397.08, 41578738.06, 37939222.8,
72318853.08, 69227247.24, 48425032.67, 44887932.18, 44839531.56,
46653896.1, 50528060.96, 48510233.52, 45287812.5, 47996739.75,
50506597.32, 49953221.25, 45082541.65, 40761074.48, 37089971.51,
35347203.52, 60140057.4, 62201466.24, 66205261.4, 79963119.81,
52657770.96, 45693025.53, 39641683.2, 41987044.93, 38001419.1,
65859233.16, 51517016.5, 84517033.09, 76731988.76, 81190272.8,
100973139.4, 74370051.84, 66099771.92, 54711958.7, 50435379,
60699821.28, 58144756.89, 46428726.18, 41948809.2, 38270574,
38391165.6, 68878794.6, 64447720.56, 45980073.86, 45240493.7,
58075101.72, 51503330.42, 40139239.59, 44122213.97, 49962063.55,
67780293.12, 59221053.22, 49270565.45, 46817995.2, 40256672.96,
52432146.06, 47653320.52, 45396767.6, 57149013.33, 64689594.6,
42469739.92, 43103574.75, 65971509.17, 76070725.4, 71395437.96,
90445164.8, 73185842.88, 63130564.42, 52363804.43, 49087128.75,
52508354.86, 48128937.78), Ticker = c("ZBH US", "ZBH US", "ZBH US",
"ZBH US", "XYL US", "XYL US", "XYL US", "XYL US", "XL US", "XL US",
"XL US", "XL US", "XL US", "XL US", "XL US", "XL US", "XEL US",
"XEL US", "XEL US", "XEL US", "XEL US", "XEL US", "XEL US", "XEL US",
"XEL US", "XEL US", "XEL US", "XEL US", "WLTW US", "WLTW US",
"WY US", "WY US", "WY US", "WY US", "HCN US", "WFC US", "WFC US",
"WFC US", "WFC US", "WFC US", "WFC US", "WFC US", "WFC US", "WFC US",
"WEC US", "WEC US", "WEC US", "WEC US", "WEC US", "WEC US", "WEC US",
"WEC US", "WEC US", "WAT US", "WAT US", "WM US", "WM US", "WM US",
"WM US", "WM US", "WM US", "WM US", "WM US", "WM US", "WM US",
"WM US", "WM US", "WM US", "WM US", "DIS US", "DIS US", "DIS US",
"DIS US", "WMT US", "WMT US", "WMT US", "WMT US", "WMT US", "WMT US",
"WMT US", "WMT US", "WMT US", "WMT US", "VNO US", "VNO US", "VNO US",
"VNO US", "VNO US", "VNO US", "VFC US", "VZ US", "VZ US", "VZ US",
"VZ US", "VZ US", "VZ US", "VZ US", "VZ US", "VZ US", "VZ US"
)), .Names = c(".id", "Name", "X..Wgt..P.", "Mkt.Val..P.", "Ticker"
), row.names = c(NA, 100L), class = "data.frame")
答案 0 :(得分:2)
尽管弗兰克在a comment发布了一个解决方案,但这个问题还没有得到答案。
因此,我将使用来自dplyr
的{{1}} / tidyr
和交叉加入 / dcast()
发布两个实施Frank方法的解决方案。
在所有三种方法中,data.table
列针对每个.id
展开,然后Ticker
/ {{Ticker
{{}}的每个唯一组合计算.id
的出现次数1}},然后最终从长格式转换为宽格式。
.id
/ dplyr
tidyr
library(dplyr) library(tidyr) DF %>% group_by(Ticker) %>% expand(.id, .id) %>% group_by(.id, .id1) %>% count() %>% spread(.id1, n)
# A tibble: 14 x 15
# Groups: .id [14]
.id `2014.01.14` `2014.04.14` `2014.07.14` `2014.10.14` `2015.01.14` `2015.04.14` `2015.07.14` `2015.10.14`
* <chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 2014.01.14 6 6 5 5 5 4 3 3
2 2014.04.14 6 6 5 5 5 4 3 3
3 2014.07.14 5 5 5 5 5 4 3 3
4 2014.10.14 5 5 5 7 7 6 5 5
5 2015.01.14 5 5 5 7 8 7 6 6
6 2015.04.14 4 4 4 6 7 9 7 7
7 2015.07.14 3 3 3 5 6 7 9 8
8 2015.10.14 3 3 3 5 6 7 8 8
9 2016.01.14 3 3 3 5 6 7 6 6
10 2016.04.14 2 2 2 3 4 4 3 3
11 2016.07.14 4 4 3 4 5 4 3 3
12 2016.10.14 4 4 3 3 4 3 2 2
13 2017.01.06 4 4 4 4 5 4 4 4
14 2017.03.06 4 4 4 4 5 4 4 4
# ... with 6 more variables: `2016.01.14` <int>, `2016.04.14` <int>, `2016.07.14` <int>, `2016.10.14` <int>,
# `2017.01.06` <int>, `2017.03.06` <int>
data.table
library(data.table) setDT(DF)[, CJ(.id, .id), by = Ticker][, dcast(.SD, V1 ~ V2, length, value.var = "Ticker")]