清理数据集!根据另一个变量的日期范围创建一个新变量

时间:2015-09-11 23:13:42

标签: r variables data-cleaning

我有两个数据集,需要进行一些清理以便于分析,它们都必须做一些非常相似的事情。让我们从数据集A开始。

A看起来像这样:

ID  Test.Date   Test.Score

1   9/22/14        25
2   1/3/2015       50
3   3/17/2015      52
4   6/1/2015       56

所以,我需要的是3个新变量,它们根据日期范围从Test.Score中提取分数。一个范围是8/01 / 2014-11 / 01/2014,另一个是12/01 / 2014-3 / 25/2015,第三个是4/01 / 2015-6 / 30/2015。这些与秋季,冬季和春季数据集相关。所以新数据框应如下所示:

ID  Test.Date   Test.Score   Test.Fall   Test.Winter   Test.Spring

1   9/22/14        25           25
2   1/3/2015       50                        50
3   3/17/2015      52                        52
4   6/1/2015       56                                       56

如果在该日期范围内未收集特定分数,NA优选地填充新变量列中的斑点。这有意义吗?

数据集B类似,但每个ID有3个日期和3个分数。所以它看起来像这样:

 ID  Test.Date1   Test.Date2   Test.Date3   Test.Score1   Test.Score.2   Test.Score3

1   9/22/14       1/3/2015        6/1/2015      25            30            55
2   9/22/14       10/3/2015        6/1/2015      26            31            66
3   9/22/14       1/3/2015        6/1/2015      25            39            63
4   9/22/14       1/3/2015        6/1/2015      22            29            56

B需要创建3个名为Test.Fall,Test.Winter,Test.Spring的新列,并根据与上面相同的数据范围拉取值。现在,您可能想知道,为什么不重命名测试分数列?因为一些参与者在彼此的几周内进行了两次测试(参见样本ID#2)。我们需要每秋季,冬季,春季进行一次测试。因此,如果Test.Date1是在9月份,而Test.Date2是10月份,则该参与者ID根本不会获得Test.Winter评分。

我需要澄清什么吗?

2 个答案:

答案 0 :(得分:0)

library(dplyr)
library(lubridate)
library(tidyr)
library(stringi)

A = 
  data_frame(
    ID = 1:4,
    Test.Date = c("9/22/2014", "1/3/2015", "3/17/2015", "6/1/2015") %>% mdy,
    Test.Score = c(25, 50, 52, 56))

classify = function(date_vector)
  ifelse(
    date_vector %>% between(mdy("8/01/2014"), mdy("11/01/2014")),
    "Fall",
    ifelse(
      date_vector %>% between(mdy("12/01/2014"), mdy("3/25/2015")),
      "Winter",
      ifelse(
        date_vector %>% between(mdy("4/01/2015"), mdy("6/30/2015")),
        "Spring",
        "Other"))) %>%
  paste("Test", ., sep = ".")


result.A = 
  A %>%
  mutate(Season = classify(Test.Date)) %>%
  spread(Season, Test.Score) %>%
  select(-Test.Date) %>%
  left_join(A)

B = 
  data_frame(
    ID = 1:4,
    Test.Date1 = 
      c("9/22/14", "9/22/14", "9/22/14", "9/22/14") %>% mdy,
    Test.Date2 = 
      c("1/3/2015", "10/3/2015", "1/3/2015 ", "1/3/2015 ") %>% mdy,
    Test.Date3 = 
      c("6/1/2015", "6/1/2015", " 6/1/2015",  "6/1/2015") %>% mdy,
    Test.Score1 = c(25, 26, 25, 22),
    Test.Score2 = c(30, 31, 39, 29),
    Test.Score3 = c(55, 66, 63, 56))

result.B =
  B %>% 
  gather(variable, value, -ID) %>%
  mutate(type = 
           variable %>% stri_sub(6, -2),
         rep = 
           variable %>% stri_sub(-1)) %>%
  select(-variable) %>%
  spread(type, value) %>%
  mutate(Date = as.POSIXct(Date, tz = "UTC", origin = origin),
         Season = classify(Date)) %>%
  group_by(ID, Season) %>%
  summarize(Score = mean(Score)) %>%
  spread(Season, Score)

答案 1 :(得分:0)

以下是数据集的解决方案使用base R

的问题
# Construct Test Data
mydata <- data.frame(ID = c(1:4),
                     Test.Date = c("9/22/14", "1/3/2015", "3/17/2015", "6/1/2015"),
                     Test.Score = c(25, 50, 52, 56))

# Format dates
mydata$Test.Date <- as.character(mydata$Test.Date)
mydata$newDate <- as.Date(mydata$Test.Date, "%m/%d/%y")
mydata$newDate[2:4] <- as.Date(mydata$Test.Date[2:4], "%m/%d/%Y")

#Classify Tests
mydata$Test.Fall = ifelse(mydata$newDate > "2014-08-01" & mydata$newDate < "2014-11-01", mydata$Test.Score, NA)
mydata$Test.Winter = ifelse(mydata$newDate > "2014-12-01" & mydata$newDate < "2015-03-25", mydata$Test.Score, NA)
mydata$Test.Spring = ifelse(mydata$newDate > "2015-04-01" & mydata$newDate < "2015-06-30", mydata$Test.Score, NA)

您对数据集B的问题并不完全清楚:

  

我们需要在每个秋季,冬季,春季进行一次测试。因此,如果Test.Date1是在9月份,而Test.Date2是10月份,则该参与者ID根本就没有Test.Winter得分。

如果有两个测试日期属于秋季日期范围,那么哪个测试分数应该在Test.Fall中?分数越高?最近的得分?平均?很高兴在您提供该信息时更新我的​​答案。