在R中,如何对data.frame的特定子集执行操作?

时间:2017-04-20 23:54:24

标签: r

(我有一种感觉,在得到答案之后我会感到很愚蠢,但我无法理解这一点。)

我有一个data.frame,末尾有一个空列。它将主要填充NA,但我想用一个值填充它的一些行。此列表示对data.frame中其中一列中缺少的数据的猜测。

我的初始data.frame看起来像这样:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A    | 6      | 3          | 6          |
B    | 7      | 3          | 7          |
C    | 6.5    | 3          | N/A        |median(df$MaxPlayers[df$MinPlayers ==3,])
D    | 7      | 3          | 6          |
E    | 7      | 3          | 5          |
F    | 9.5    | 2          | 5          |
G    | 6      | 2          | 4          |
H    | 7      | 2          | 4          |
I    | 6.5    | 2          | N/A        |median(df$MaxPlayers[df$MinPlayers ==2,])
J    | 7      | 2          | 2          |
K    | 7      | 2          | 4          |

请注意,其中两行对于MaxPlayers具有“N / A”。我试图做的是使用我所掌握的信息来猜测MaxPlayers可能是什么。如果3个玩家游戏的中位数(MaxPlayers)为6,则对于MinPlayers == 3和MaxPlayers == N / A的游戏,MaxPlayerGuess应该等于6。 (我试图在代码中指出MaxPlayerGuess在上面的例子中应该得到什么值。)

生成的data.frame看起来像这样:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A    | 6      | 3          | 6          |
B    | 7      | 3          | 7          |
C    | 6.5    | 3          | N/A        |6
D    | 7      | 3          | 6          |
E    | 7      | 3          | 5          |
F    | 9.5    | 2          | 5          |
G    | 6      | 2          | 4          |
H    | 7      | 2          | 4          |
I    | 6.5    | 2          | N/A        |4
J    | 7      | 2          | 2          |
K    | 7      | 2          | 4          |

分享一次尝试的结果:

gld$MaxPlayersGuess <- ifelse(is.na(gld$MaxPlayers), median(gld$MaxPlayers[gld$MinPlayers,]), NA)


Error in gld$MaxPlayers[gld$MinPlayers, ] : 
incorrect number of dimensions

3 个答案:

答案 0 :(得分:2)

相对于发布的示例进行更新。

这是我今天的提示,有时候您可以更轻松地计算出您想要的内容,然后在您需要时抓住它而不是使用所有这些逻辑上的相关信息。您试图想出一种方法来同时计算所有内容并使其混乱,将其分解为步骤。您需要知道&#34; MaxPlayer&#34;的中值。对于每个可能的&#34; MinPlayer&#34;组。然后,您希望在缺少MaxPlayer时使用该值。所以这是一个简单的方法。

#generate fake data 
MinPlayer <- rep(3:2, each = 4)
MaxPlayer <- rep(2:5, each = 2, times = 2)

df <- data.frame(MinPlayer, MaxPlayer)

#replace some values of MaxPlayer with NA
df$MaxPlayer <- ifelse(df$MaxPlayer == 3, NA, df$MaxPlayer)

####STARTING DATA
# > df
# MinPlayer MaxPlayer
# 1          3         2
# 2          3         2
# 3          3        NA
# 4          3        NA
# 5          2         4
# 6          2         4
# 7          2         5
# 8          2         5
# 9          3         2
# 10         3         2
# 11         3        NA
# 12         3        NA
# 13         2         4
# 14         2         4
# 15         2         5
# 16         2         5

####STEP 1
#find the median of MaxPlayer for each group of MinPlayer (e.g., when MinPlayer == 1, 2 or whatever)
#just add a column to the data frame that has the right median value for each subset of MinPlayer in it and grab that value to use later. 
library(plyr) #plyr is a great way to compute things across data subsets
df <- ddply(df, c("MinPlayer"), transform, 
            median.minp = median(MaxPlayer, na.rm = TRUE)) #ignore NAs in the median

####STEP 2
#anytime that MaxPlayer == NA, grab the median value to replace the NA, otherwise keep the MaxPlayer value
df$MaxPlayer <- ifelse(is.na(df$MaxPlayer), df$median.minp, df$MaxPlayer)

####STEP 3
#you had to compute an extra column you don't really want, so drop it now that you're done with it
df <- df[ , !(names(df) %in% "median.minp")]

####RESULT
# > df
# MinPlayer MaxPlayer
# 1          2         4
# 2          2         4
# 3          2         5
# 4          2         5
# 5          2         4
# 6          2         4
# 7          2         5
# 8          2         5
# 9          3         2
# 10         3         2
# 11         3         2
# 12         3         2
# 13         3         2
# 14         3         2
# 15         3         2
# 16         3         2

下面的旧答案......

请发布可重复的示例!!

#fake data 
this <- rep(1:2, each = 1, times = 2)
that <- rep(3:2, each = 1, times = 2)

df <- data.frame(this, that)

如果您只是询问基本索引....例如,找到符合条件的值,这将返回符合条件的值的行索引(查找?):

> which(df$this < df$that)
[1] 1 3

这将返回与您的条件匹配的值的值,而不是行索引 - 您只需使用&#34返回的行索引;&#34;在数据框的正确列中找到相应的值(此处为#34;此&#34;)

> df[which(df$this < df$that), "this"]
[1] 1 1

如果你想在&#34;这个&#34;是&#34;少&#34;除此之外,在数据框中添加一个新列,只需使用&#34; ifelse&#34;。如果else创建一个逻辑向量,其中的东西符合您的条件,然后对符合您条件的事物进行处理(例如,您的逻辑测试== TRUE)。

#if "this" is < "that", multiply by 2 
df$result <- ifelse(df$this < df$that, df$this * 2, NA)

> df
this that result
1    1    3      2
2    2    2     NA
3    1    3      2
4    2    2     NA

如果没有可重复的示例,则无法提供更多示例。

答案 1 :(得分:1)

我认为你已经在@ griffmer的答案中拥有了所需的一切。但是不太优雅但可能更直观的方式可能是一个循环:

## Your data:
df <- data.frame(
        Game = LETTERS[1:11],
        Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
        MinPlayers = c(rep(3,5), rep(2,6)),
        MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)     
)

## Loop over rows:
df$MaxPlayersGuess <- vapply(1:nrow(df), function(ii){
            if (is.na(df$MaxPlayers[ii])){
                median(df$MaxPlayers[df$MinPlayers == df$MinPlayers[ii]],
                        na.rm = TRUE)               
            } else {
                df$MaxPlayers[ii]
            }           
        }, numeric(1))

给你

df
#    Game Rating MinPlayers MaxPlayers MaxPlayersGuess
# 1     A    6.0          3          6               6
# 2     B    7.0          3          7               7
# 3     C    6.5          3         NA               6
# 4     D    7.0          3          6               6
# 5     E    7.0          3          5               5
# 6     F    9.5          2          5               5
# 7     G    6.0          2          4               4
# 8     H    7.0          2          4               4
# 9     I    6.5          2         NA               4
# 10    J    7.0          2          2               2
# 11    K    7.0          2          4               4

答案 2 :(得分:1)

如果您想使用/Users/John/perl5/lib,可以尝试:

输入:

PERL5LIB

过程:

Q: Why didn't ASan report an obviously invalid memory access in my code?

A1: If your errors is too obvious, compiler might have already optimized it 
    out by the time Asan runs.

A2: Another, C-only option is accesses to global common symbols which are
    not protected by Asan (you can use -fno-common to disable generation of
    common symbols and hopefully detect more bugs).

这会将数据基础dplyr分组,然后将df <- data.frame( Game = LETTERS[1:11], Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7), MinPlayers = c(rep(3,5), rep(2,6)), MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4) ) 的中间值分配给缺少数据的行。

输出:

df %>% 
  group_by(MinPlayers) %>%
  mutate(MaxPlayers = if_else(is.na(MaxPlayers), median(MaxPlayers, na.rm=TRUE), MaxPlayers))