让NA值的处理取决于它们相对于数据帧中组中可用值的数量的数量,在R中

时间:2014-11-13 10:18:00

标签: r dplyr

我有一个包含州之间合同的数据集。缔约国的数量从2到94不等。在另一个数据框架中,每个州都归因于一个名为“政体”的价值 - 尽管对某些人来说,这个价值是缺失的。

在这个论坛的帮助下,我合并了两个数据框,然后通过取得合同状态的min()和max()" polity" -values的差异来总结合同。

现在,我不想忽略或排除NA值。如果合同国家中的NA值的数量超过合同状态数量的某一部分,我想将合同的政体价值视为NA(对于这些数据框架,最方便的是说4/5的必须提供政体价值才能在分析中采用合同。

这是我的数据集的两个简化版本:

treaties <- data.frame(treaty.ID=c(1,1,2,2,3,3,3,4,4,4,4,4),
                   Treaty=c("hungary slovenia 1994", "hungary slovenia 1994",
                            "taiwan hungary 1994", "taiwan hungary 1994", 
                            "Treaty of Izmir 1977", "Treaty of Izmir 1977",
                            "Treaty of Izmir 1977", "Treaty of Five 1909", 
                            "Treaty of Five 1909", "Treaty of Five 1909",
                            "Treaty of Five 1909","Treaty of Five 1909"),
                   scode=c("HUN","SLV","TAW","HUN", "IRN", "TUR", "PAK", 
                           "AUS","AUL","NEW","USA","CAN"),
                   year=c(1994, 1994, 1994, 1994, 1977, 1977, 1977, 1909, 
                          1909, 1909, 1909, 1909),
                   pr.dem=c(1,1,0,0,0,0,0,1,1,1,1,1))

POL <- data.frame(country=c("Hungary", "Slovenia", "Taiwan","Austria",
                           "Australia", "New Zealand", "USA", "Canada",
                           "Iran","Turkey", "Pakistan"),
                 scode=c("HUN", "SLV", "TAW", "AUS", "AUL", "NEW", "USA",
                         "CAN", "IRN", "TUR", "PAK"),
                 year=c(1994, 1994, 1994, 1909, 1909, 1909, 1909, 1909,
                        1977, 1977, 1977),
                 polity = c(7, NA, 9, 8, 8, 10, 10, NA, -10, 9, NA))

(因此,只有条约1和3应显示NA&#34; polity&#34;最后)

我把他们加在一起,用相同的条约将多行减少到一个,同时取决于政体价值的最大值和最小值:

require(dplyr)
left_join(treaties, POL, c("scode","year")) %>%
                                group_by(Treaty) %>% 
                               summarise(PolityDiff=max(polity)-min(polity))

我想知道是否可以让NA值的处理取决于它们的数量而不是分组数据框中的可用值数量?

我试图包含一个ifelse函数:

DIFF <- left_join(treaties, Polity, c("scode","year")) %>%
                       group_by(DIFF, File)

summarise(DIFF, polity.Diff=max(polity, na.rm = ifelse(length(polity = NA) >= 0.2*length(polity), TRUE, FALSE))-
            min(polity, na.rm = ifelse(length(polity = NA) >= 0.2*length(polity), TRUE, FALSE)))

但它返回错误:

Error: index out of bounds

我可以在“na.rm =”之后使用ifelse()函数吗?我犯了错误吗?我非常感谢你的帮助。

1 个答案:

答案 0 :(得分:1)

这应该做你想要的:

left_join(treaties, POL, c("scode","year")) %>%
  group_by(Treaty) %>%
  summarise(polity.Diff = max(polity, na.rm = sum(is.na(polity)) >= 0.2*n()) -
                          min(polity, na.rm = sum(is.na(polity)) >= 0.2*n()))
#Source: local data frame [4 x 2]
#
#                 Treaty polity.Diff
#1 hungary slovenia 1994           0
#2   taiwan hungary 1994           2
#3   Treaty of Five 1909           2
#4  Treaty of Izmir 1977          19

首先,我使用is.na()代替length(XX = NA),其次使用dplyr的特殊函数n()代替length(polity),其次,我删除了ifelse并且只留下逻辑测试 - 它将根据规范返回TRUE或FALSE。请注意,在其中3个案例中,NA将被移除,并且在一个案例中(1994年台湾匈牙利)它们没有被移除,因为该组中根本没有NA - 这就是为什么你最终没有polity.Diff中的任何NA列。

您可能会注意到,您对maxmin执行相同的逻辑测试 - 这可以通过首先创建新变量来更有效地解决,例如NAcheck,在您的数据中,然后仅引用na.rm =定义中的该变量。但是,您之后还需要删除该变量(例如,使用select(-NAcheck))。