Question

我想识别包含指标的群组。在下面的示例中，我想确定包含districts的{{1}}。如果county == 'other'中有任何county == 'other'，那么我希望指标变量为district，否则为1，对于该区中的每一行。以下是使用0，split和lapply进行的多次尝试，但这些尝试均无效。也许我可以提取any的所有行，将指标定义为该子集的一个，然后将该子集与原始数据集合并，但我一直认为必须有一个更简单的方法。谢谢你的任何建议。

county == 'other'

修改

以下是我上面提到的子集/合并方法：

df.1 <- read.table(text = '

    state    district    county    apples
       AA          EC        AB       100
       AA          EC        BC        10
       AA          EC        DC       150
       AA           C        FG       200
       AA           C     other        20
       AA           C        HC       250
       AA          WC        RT       300
       AA          WC        TT        30
       AA          WC     other       350

', header=TRUE, stringsAsFactors = FALSE)

desired.result <- read.table(text = '

    state    district    county    apples  indicator
       AA          EC        AB       100          0
       AA          EC        BC        10          0
       AA          EC        DC       150          0
       AA           C        FG       200          1
       AA           C     other        20          1
       AA           C        HC       250          1
       AA          WC        RT       300          1
       AA          WC        TT        30          1
       AA          WC     other       350          1

', header=TRUE, stringsAsFactors = FALSE)

# various attempts that do not work

with(df.1, lapply(split(county, district), function(x) {any(x)=='county' <- 1} ))
with(df.1, lapply(split(county, district), function(x) {ifelse(any(x)=='other', 1, 0)} ))
with(df.1, lapply(split(county, district), function(x) {any(x)=='other'} ))
with(df.1, lapply(split(df.1  , district), function(x) {any(x$county)=='other'} ))
with(df.1, lapply(split(county, district), function(x) {x=='other'} ))

我更喜欢使用基础R。

Answer 1

library(data.table)

dt = data.table(df.1)
dt[, indicator := 1*any(county == 'other'), by = district]

dt
#   state district county apples indicator
#1:    AA       EC     AB    100         0
#2:    AA       EC     BC     10         0
#3:    AA       EC     DC    150         0
#4:    AA        C     FG    200         1
#5:    AA        C  other     20         1
#6:    AA        C     HC    250         1
#7:    AA       WC     RT    300         1
#8:    AA       WC     TT     30         1
#9:    AA       WC  other    350         1

这是一个基本的解决方案 - 它慢得多，而且更加丑陋，但是如果那是OP的话，那就好了:)。

df.1$indicator = as.numeric(ave(df.1$county, df.1$district,
                                FUN = function(x) {1*any(x == "other")}))

或者

df.1$indicator <- with(df.1, ave(county=='other', district, FUN=max))

或者

df.1$indicator <- with(df.1, ave(county=='other', district, FUN=any)+0L)

Answer 2

这是我迄今为止使用apply系列函数能够提出的最佳效果：

df.1 <- read.table(text = '

    state    district    county    apples
       AA          EC        AB       100
       AA          EC        BC        10
       AA          EC        DC       150
       AA           C        FG       200
       AA           C     other        20
       AA           C        HC       250
       AA          WC        RT       300
       AA          WC        TT        30
       AA          WC     other       350

', header=TRUE, stringsAsFactors = FALSE)

z <- with(df.1, lapply(split( df.1, district), function(x) { merge(x, ifelse('other' %in% x$county, 1, 0), all=TRUE) } )) ; z
df.2 <- do.call(rbind, z)
rownames(df.2) = NULL
df.2

，并提供：

  state district county apples y
1    AA        C     FG    200 1
2    AA        C  other     20 1
3    AA        C     HC    250 1
4    AA       EC     AB    100 0
5    AA       EC     BC     10 0
6    AA       EC     DC    150 0
7    AA       WC     RT    300 1
8    AA       WC     TT     30 1
9    AA       WC  other    350 1

Answer 3

当尝试用我的实际数据实现上述两个答案时，我意识到我必须考虑一个新的变量df.1$year，并且我需要在指标变量应该是一个之前满足更复杂的条件：{{1} } df.1$county == 'other' & is.na(df.1$apples)和district内。以下是修订后的数据集year和修订后的df.1声明，以实现这些新条件。我没有lapply使用这些新条件，但我确实借用了一些eddi的代码。

这个修订后的情景似乎与我发布新问题的原始问题密切相关。

ave

识别包含指标的组

3 个答案: