我有两个数据集,需要进行一些清理以便于分析,它们都必须做一些非常相似的事情。让我们从数据集A开始。
A看起来像这样:
ID Test.Date Test.Score
1 9/22/14 25
2 1/3/2015 50
3 3/17/2015 52
4 6/1/2015 56
所以,我需要的是3个新变量,它们根据日期范围从Test.Score中提取分数。一个范围是8/01 / 2014-11 / 01/2014,另一个是12/01 / 2014-3 / 25/2015,第三个是4/01 / 2015-6 / 30/2015。这些与秋季,冬季和春季数据集相关。所以新数据框应如下所示:
ID Test.Date Test.Score Test.Fall Test.Winter Test.Spring
1 9/22/14 25 25
2 1/3/2015 50 50
3 3/17/2015 52 52
4 6/1/2015 56 56
如果在该日期范围内未收集特定分数,NA优选地填充新变量列中的斑点。这有意义吗?
数据集B类似,但每个ID有3个日期和3个分数。所以它看起来像这样:
ID Test.Date1 Test.Date2 Test.Date3 Test.Score1 Test.Score.2 Test.Score3
1 9/22/14 1/3/2015 6/1/2015 25 30 55
2 9/22/14 10/3/2015 6/1/2015 26 31 66
3 9/22/14 1/3/2015 6/1/2015 25 39 63
4 9/22/14 1/3/2015 6/1/2015 22 29 56
B需要创建3个名为Test.Fall,Test.Winter,Test.Spring的新列,并根据与上面相同的数据范围拉取值。现在,您可能想知道,为什么不重命名测试分数列?因为一些参与者在彼此的几周内进行了两次测试(参见样本ID#2)。我们需要每秋季,冬季,春季进行一次测试。因此,如果Test.Date1是在9月份,而Test.Date2是10月份,则该参与者ID根本不会获得Test.Winter评分。
我需要澄清什么吗?
答案 0 :(得分:0)
library(dplyr)
library(lubridate)
library(tidyr)
library(stringi)
A =
data_frame(
ID = 1:4,
Test.Date = c("9/22/2014", "1/3/2015", "3/17/2015", "6/1/2015") %>% mdy,
Test.Score = c(25, 50, 52, 56))
classify = function(date_vector)
ifelse(
date_vector %>% between(mdy("8/01/2014"), mdy("11/01/2014")),
"Fall",
ifelse(
date_vector %>% between(mdy("12/01/2014"), mdy("3/25/2015")),
"Winter",
ifelse(
date_vector %>% between(mdy("4/01/2015"), mdy("6/30/2015")),
"Spring",
"Other"))) %>%
paste("Test", ., sep = ".")
result.A =
A %>%
mutate(Season = classify(Test.Date)) %>%
spread(Season, Test.Score) %>%
select(-Test.Date) %>%
left_join(A)
B =
data_frame(
ID = 1:4,
Test.Date1 =
c("9/22/14", "9/22/14", "9/22/14", "9/22/14") %>% mdy,
Test.Date2 =
c("1/3/2015", "10/3/2015", "1/3/2015 ", "1/3/2015 ") %>% mdy,
Test.Date3 =
c("6/1/2015", "6/1/2015", " 6/1/2015", "6/1/2015") %>% mdy,
Test.Score1 = c(25, 26, 25, 22),
Test.Score2 = c(30, 31, 39, 29),
Test.Score3 = c(55, 66, 63, 56))
result.B =
B %>%
gather(variable, value, -ID) %>%
mutate(type =
variable %>% stri_sub(6, -2),
rep =
variable %>% stri_sub(-1)) %>%
select(-variable) %>%
spread(type, value) %>%
mutate(Date = as.POSIXct(Date, tz = "UTC", origin = origin),
Season = classify(Date)) %>%
group_by(ID, Season) %>%
summarize(Score = mean(Score)) %>%
spread(Season, Score)
答案 1 :(得分:0)
以下是数据集的解决方案使用base
R
# Construct Test Data
mydata <- data.frame(ID = c(1:4),
Test.Date = c("9/22/14", "1/3/2015", "3/17/2015", "6/1/2015"),
Test.Score = c(25, 50, 52, 56))
# Format dates
mydata$Test.Date <- as.character(mydata$Test.Date)
mydata$newDate <- as.Date(mydata$Test.Date, "%m/%d/%y")
mydata$newDate[2:4] <- as.Date(mydata$Test.Date[2:4], "%m/%d/%Y")
#Classify Tests
mydata$Test.Fall = ifelse(mydata$newDate > "2014-08-01" & mydata$newDate < "2014-11-01", mydata$Test.Score, NA)
mydata$Test.Winter = ifelse(mydata$newDate > "2014-12-01" & mydata$newDate < "2015-03-25", mydata$Test.Score, NA)
mydata$Test.Spring = ifelse(mydata$newDate > "2015-04-01" & mydata$newDate < "2015-06-30", mydata$Test.Score, NA)
您对数据集B的问题并不完全清楚:
我们需要在每个秋季,冬季,春季进行一次测试。因此,如果Test.Date1是在9月份,而Test.Date2是10月份,则该参与者ID根本就没有Test.Winter得分。
如果有两个测试日期属于秋季日期范围,那么哪个测试分数应该在Test.Fall中?分数越高?最近的得分?平均?很高兴在您提供该信息时更新我的答案。