过滤数据框中的数据

时间:2017-06-13 23:13:33

标签: r dataframe dplyr

我有一个如下所示的数据框:

S1State S1Value S2State S2Value
NSW     20      VIC     30
WA      30      NSW     20

我想过滤并选择具有最大值的状态(来自S1State和S2State)(来自S1Value和S2Value)。结果应如下所示:

SState  SValue
VIC     30
WA      30

我是R的新手并且一直在尝试使用dplyr。

3 个答案:

答案 0 :(得分:2)

我暗示的答案如下:

library(dplyr)
dt <- read.table(text = "S1State S1Value S2State S2Value
                 NSW     20      VIC     30
                 WA      30      NSW     20",
                 header = TRUE, stringsAsFactors = FALSE)
answer = dt %>% 
  mutate(SState = ifelse(S1Value > S2Value, S1State, S2State), 
         SValue = ifelse(S1Value > S2Value, S1Value, S2Value)) %>%
  select(SState, SValue)

答案 1 :(得分:2)

只是为了表明使用标准R工具远非不可能:

nams <- c("State","Value")
tmp  <- reshape(dt, direction="long", varying=lapply(nams, grep, x=names(dt)),
                v.names=nams, timevar=NULL)
tmp[with(tmp, Value == ave(Value, id, FUN=max)),]
#    State Value id
#2.1    WA    30  2
#1.2   VIC    30  1

答案 2 :(得分:1)

我假设OP可能在数据框中有更多状态,例如S3StateS4State,...

以下解决方案基于此假设,试图能够处理多个状态。如果只有两种状态,@lebelinoz提出的方法简单明了。

解决方案1 ​​

使用dplyrtidyr中的函数的解决方案。 dt2是最终输出。

# Load packages
library(dplyr)
library(tidyr)

# Process the data
dt2 <- dt %>%
  gather(Num, Value, contains("Value")) %>%
  gather(State, Name, contains("State")) %>%
  # Only keep records with the same state number
  filter(substring(Num, 1, 2) == substring(State, 1, 2)) %>%
  mutate(Group = substring(Num, 1, 2)) %>%
  group_by(Group) %>%
  filter(Value == max(Value)) %>%
  ungroup() %>%
  select(SState = Name, SSValue = Value)

解决方案2

使用dplyrpurrrstringr中的函数的解决方案。我为前两个软件包加载了包tidyverse。同样,dt2是最终输出。

# Load packages
library(tidyverse)
library(stringr)

# Extract the column names
Col <- colnames(dt)

# Extract state numbers
ColNum <- Col %>%
  str_extract(pattern = "[0-9]") %>%
  unique()

# Design a function to process the data
dt_process <- function(pattern, dt){
  dt2 <- dt %>%
    # Extract columns based on a pattern (numbers)
    select(dplyr::contains(pattern)) %>%
    # Rename the columns
    rename_all(~sub(pattern, "", .)) %>%
    # Filter the maximum row
    filter(SValue == max(SValue))
  return(dt2)
}

# Apply the dt_process function
dt_list <- map(.x = ColNum, .f = dt_process, dt = dt)

# Bind all data frames
dt2 <- bind_rows(dt_list) %>% arrange(SState)

数据准备

# Create example data frame
    dt <- read.table(text = "S1State S1Value S2State S2Value
                     NSW     20      VIC     30
                     WA      30      NSW     20",
                     header = TRUE, stringsAsFactors = FALSE)