从两列的交集中删除并基于第三列中的条件

时间:2018-03-04 16:14:04

标签: r

我有以下数据:

df <- structure(list(IDVar = 1:40, Major.sectors = structure(c(5L, 
                                                                   9L, 3L, 15L, 11L, 7L, 18L, 18L, 18L, 3L, 3L, 3L, 3L, 17L, 3L, 
                                                                   11L, 7L, 17L, 3L, 11L, 3L, 18L, 3L, 17L, 9L, 18L, 9L, 19L, 3L, 
                                                                   11L, 11L, 2L, 5L, 3L, 18L, 17L, 4L, 2L, 3L, 3L), .Label = c("Banks", 
                                                                                                                               "Chemicals, rubber, plastics, non-metallic products", "Construction", 
                                                                                                                               "Education, Health", "Food, beverages, tobacco", "Gas, Water, Electricity", 
                                                                                                                               "Hotels & restaurants", "Insurance companies", "Machinery, equipment, furniture, recycling", 
                                                                                                                               "Metals & metal products", "Other services", "Post & telecommunications", 
                                                                                                                               "Primary sector", "Public administration & defense", "Publishing, printing", 
                                                                                                                               "Textiles, wearing apparel, leather", "Transport", "Wholesale & retail trade", 
                                                                                                                               "Wood, cork, paper"), class = "factor"), Region.in.country = structure(c(15L, 
                                                                                                                                                                                                        8L, 8L, 8L, 10L, 15L, 19L, 10L, 8L, 10L, 3L, 18L, 4L, 12L, 4L, 
                                                                                                                                                                                                        15L, 13L, 4L, 15L, 15L, 7L, 15L, 12L, 1L, 7L, 10L, 15L, 8L, 13L, 
                                                                                                                                                                                                        15L, 12L, 8L, 7L, 15L, 15L, 10L, 8L, 10L, 10L, 15L), .Label = c("Andalucia", 
                                                                                                                                                                                                                                                                        "Aragon", "Asturias", "Canary Islands", "Cantabria", "Castilla-La Mancha", 
                                                                                                                                                                                                                                                                        "Castilla y Leon", "Cataluna", "Ceuta", "Comunidad Valenciana", 
                                                                                                                                                                                                                                                                        "Extremadura", "Galicia", "Islas Baleares", "La Rioja", "Madrid", 
                                                                                                                                                                                                                                                                        "Melilla", "Murcia", "Navarra", "Pais Vasco"), class = "factor"), 
                         EBIT.TA = c(-0.234432635519391, -0.884337466274593, -0.00446559204081373, 
                                     0.11109107677028, -0.137203773525798, -0.582114677880617, 
                                     0.0190497663203189, -3.04252763094666, 0.113157822682219, 
                                     -0.0255533180037229, 0.281767142199724, 0.0326641697396841, 
                                     -0.00879974750993553, 0.0542074697816672, -0.112104697294392, 
                                     -0.191945591325174, -0.00380586115226597, -0.0363239884169068, 
                                     -0.273949107908537, 0.435398668004486, -0.00563436099927988, 
                                     -2.75971618056051, -0.1047327709263, 0.151283793741506, -0.0373197549569126, 
                                     0.00912639083178201, -0.0386627754065697, -0.018235399636112, 
                                     -0.0118104711362467, -0.701299939137125, NA, 0.0191819361175666, 
                                     -0.0104887983706721, -0.801677105519484, -0.402194475974272, 
                                     -0.124125227730062, 0.143020458476649, -0.601186271451194, 
                                     0.0163269364787831, 5.09955167591238), EBIT.TA_l1 = c(-0.443687074746458, 
                                                                                           -0.561864166134075, -0.0345769510044604, 0.0282541797531804, 
                                                                                           -0.0181173929170762, 0.0147211350970115, 0.0588534950162799, 
                                                                                           -1.14097109926961, 0.060100343733096, -0.0386426338471025, 
                                                                                           0.049684095221329, 0.0558174150334904, 0.00214962169435867, 
                                                                                           0.0399960114646072, 0.0402934579830171, -0.612359147433149, 
                                                                                           -0.0115916125659674, 0.00739473610413031, 0.0174576615247567, 
                                                                                           0.68624861825246, 0.0305807338940829, -3.88006243913616, 
                                                                                           0.0410122725022661, -0.089491343996377, -0.215219123182103, 
                                                                                           0.00967853324842811, -0.0336715197882038, 0.362424791356667, 
                                                                                           0.221203934329637, -0.654387857513823, 0.0656934439915892, 
                                                                                           0.0652005453654772, 0.0339559014267185, 0.0259085077216708, 
                                                                                           -0.303606048856146, 0.0280113794301873, 0.109307291990628, 
                                                                                           -0.470048555841697, -0.00157699300508027, -0.350519090107081
                                     ), EBIT.TA_l2 = c(-0.351308186716873, 0.00159428805074234, 
                                                       -0.00604587147802615, 0.0761894448922952, -0.00348378141492824, 
                                                       NA, 0.0346370866793768, -0.552226781084599, 0.00220031803369861, 
                                                       -0.0285840972149053, 0.065316579236306, 0.4090851643341, 
                                                       -0.0188362202518351, 0.0403848986306371, 0.091146090480032, 
                                                       -0.0154168449752466, -0.0694803621032671, 0.0511978643139393, 
                                                       -0.452924037757731, -0.0091835704914724, 0.0119918914092344, 
                                                       0.0858960833880717, NA, 0.104901526886479, -0.23096183545392, 
                                                       -0.0163058345980967, 0.100643431561465, 0.0527859573541712, 
                                                       0.250207316117438, NA, 0.00193240515291123, 0.0624210741756767, 
                                                       0.0178136227732972, -0.0321294913646274, -0.0699629484084657, 
                                                       -0.00417176180400133, 0.209612573099415, 0.0285645570852926, 
                                                       0.0551624216079071, 0.0172738293439595), Major.sectors.id = c(1L, 
                                                                                                                     2L, 3L, 4L, 5L, 6L, 7L, 7L, 7L, 3L, 3L, 3L, 3L, 8L, 3L, 5L, 
                                                                                                                     6L, 8L, 3L, 5L, 3L, 7L, 3L, 8L, 2L, 7L, 2L, 9L, 3L, 5L, 5L, 
                                                                                                                     10L, 1L, 3L, 7L, 8L, 11L, 10L, 3L, 3L), Region.in.country.id = c(1L, 
                                                                                                                                                                                      2L, 2L, 2L, 3L, 1L, 4L, 3L, 2L, 3L, 5L, 6L, 7L, 8L, 7L, 1L, 
                                                                                                                                                                                      9L, 7L, 1L, 1L, 10L, 1L, 8L, 11L, 10L, 3L, 1L, 2L, 9L, 1L, 
                                                                                                                                                                                      8L, 2L, 10L, 1L, 1L, 3L, 2L, 3L, 3L, 1L)), .Names = c("IDVar", 
                                                                                                                                                                                                                                            "Major.sectors", "Region.in.country", "EBIT.TA", "EBIT.TA_l1", 
                                                                                                                                                                                                                                            "EBIT.TA_l2", "Major.sectors.id", "Region.in.country.id"), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                     40L), class = "data.frame")

我随机生成一列零和一列用于说明。

x <- 40
df$x<- sample(c(0,1), replace=TRUE, size=x)

我要做的是删除基于几个条件具有zero值的行。

:如果df$x == 1 如果intersect(region.id, sector.id) == 0#即。没有数据 然后drop

所以,我想要group_by区域和扇区,如果两列之间的相交不存在,则删除该观察结果。

考虑以下图像。我基本上是要删除没有数据的列的intersects。因此,请sector.id: 1region.id: 5没有数据,所以我想删除它。 (但是,我的数据不像下面的图像那样分组,它作为dput代码。

enter image description here

1 个答案:

答案 0 :(得分:0)

我在class C示例中使用了self.c来查找缺失值。

NA

基础解决方案

x

说明:

  • # get ready set.seed(123) # set seed for reproducibility df$x <- sample(c(NA,1), 40, replace = TRUE) # sample values 将数据划分为您指定的组
  • # split by ids, check for values, bind together nonempty combinations dfs_split <- split(df, list(df$Major.sectors.id, df$Region.in.country.id)) has_value <- sapply(dfs_split, function(df) !all(is.na(df$x))) dfs_nonempty <- dfs_split[has_value] res <- do.call(rbind, dfs_nonempty) 对每个组的非缺失值应用测试
  • split帮助sapply群组(实际形成一个列表)

dplyr解决方案

这是更清洁的选择。

do.call