有人知道什么是SAS的最佳替代品。或者持续。运营商?我没找到。
SAS拥有第一名。最后。自动变量,用于识别具有特定变量的相同值的组中的第一个和最后一个记录;所以在以下数据集中定义了FIRST.model和LAST.model:
Model,SaleID,First.Model,Last.Model
Explorer,1,1,0
Explorer,2,0,0
Explorer,3,0,0
Explorer,4,0,1
Civic,5,1,0
Civic,6,0,0
Civic,7,0,1
答案 0 :(得分:9)
听起来您正在寻找!duplicated
,fromLast
参数为FALSE
或TRUE
。
d <- datasets::Puromycin
d$state
# [1] treated treated treated treated treated treated treated
# [8] treated treated treated treated treated untreated untreated
#[15] untreated untreated untreated untreated untreated untreated untreated
#[22] untreated untreated
#Levels: treated untreated
!duplicated(d$state)
# [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[13] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
!duplicated(d$state,fromLast=TRUE)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
此功能有一些警告和边缘情况行为,您可以通过帮助文件(?duplicated
)找到这些行为。
答案 1 :(得分:4)
如果您真的只对行索引感兴趣,可能会使用split
和range
的直接使用。以下假设数据集中的rownames按顺序编号,但也可能进行调整。
irisFirstLast <- sapply(split(iris, iris$Species),
function(x) range(as.numeric(rownames(x))))
irisFirstLast ## Just the indices
# setosa versicolor virginica
# [1,] 1 51 101
# [2,] 50 100 150
iris[irisFirstLast[1, ], ] ## `1` would represent "first"
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
iris[irisFirstLast, ] ## nothing would represent both first and last
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 50 5.0 3.3 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 100 5.7 2.8 4.1 1.3 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
# 150 5.9 3.0 5.1 1.8 virginica
d <- datasets::Puromycin
dFirstLast <- sapply(split(d, d$state),
function(x) range(as.numeric(rownames(x))))
dFirstLast
# treated untreated
# [1,] 1 13
# [2,] 12 23
d[dFirstLast[2, ], ] ## `2` would represent `last`
# conc rate state
# 12 1.1 200 treated
# 23 1.1 160 untreated
如果使用命名行,一般方法是相同的,但您必须自己指定范围。这是一般模式:
datasetFirstLast <- sapply(split(dataset, dataset$groupingvariable),
function(x) c(rownames(x)[1],
rownames(x)[length(rownames(x))]))
如果您有兴趣提取行而不是将行号用于其他目的,您还可以浏览data.table
。以下是一些例子:
library(data.table)
DT <- data.table(iris, key="Species")
DT[J(unique(Species)), mult = "first"]
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1: setosa 5.1 3.5 1.4 0.2
# 2: versicolor 7.0 3.2 4.7 1.4
# 3: virginica 6.3 3.3 6.0 2.5
DT[J(unique(Species)), mult = "last"]
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1: setosa 5.0 3.3 1.4 0.2
# 2: versicolor 5.7 2.8 4.1 1.3
# 3: virginica 5.9 3.0 5.1 1.8
DT[, .SD[c(1,.N)], by=Species]
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1: setosa 5.1 3.5 1.4 0.2
# 2: setosa 5.0 3.3 1.4 0.2
# 3: versicolor 7.0 3.2 4.7 1.4
# 4: versicolor 5.7 2.8 4.1 1.3
# 5: virginica 6.3 3.3 6.0 2.5
# 6: virginica 5.9 3.0 5.1 1.8
这最后一种方法非常方便。例如,如果您想要每组的前三行和最后三行,您可以使用:DT[, .SD[c(1:3, (.N-2):.N)], by=Species]
(仅供参考:.N
表示每组的案例数。
其他有用的方法包括:
DT[, tail(.SD, 2), by = Species] ## last two rows of each group
DT[, head(.SD, 4), by = Species] ## first four rows of each group
答案 2 :(得分:4)
头部和尾部功能与n = 1选项结合使用是一个很好的方法。 参见 R for SAS和SPss用户**(Robert Muenchen)根据感兴趣的变量制作数据框 即最后一次。
dfby<- data.frame(df$var1, df$var2)
mylastList<-by(df,dfby,tail, n=1)
#turn into a dataframe
mylastDF<-do.call(rbind,mylastList)
答案 3 :(得分:2)
这是一个dplyr解决方案:
# input
dataset <- structure(list(Model = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L
), .Label = c("Civic", "Explorer"), class = "factor"), SaleID = 1:7), .Names = c("Model",
"SaleID"), class = "data.frame", row.names = c(NA, -7L))
# code
library(dplyr)
dataset %>%
group_by(Model) %>%
mutate(
"First" = row_number() == min( row_number() ),
"Last" = row_number() == max( row_number() )
)
# output:
Model SaleID First Last
<fctr> <int> <lgl> <lgl>
1 Explorer 1 TRUE FALSE
2 Explorer 2 FALSE FALSE
3 Explorer 3 FALSE FALSE
4 Explorer 4 FALSE TRUE
5 Civic 5 TRUE FALSE
6 Civic 6 FALSE FALSE
7 Civic 7 FALSE TRUE
PS:如果你没有安装dplyr:
install.packages("dplyr")
答案 4 :(得分:1)
以下功能基于@ Joe对First / Last的描述。
该函数返回向量列表。
每个列表条目对应于数据帧的列(即数据集的特征或变量)
然后,在给定的列表条目中,存在属于的索引
每个观察类别的第一个(或最后一个)元素。
# Pass in your data frame, and indicate whether or not you want to find Last or find First.
# Assign to the appropriate variable
first <- findFirstLast(myDF)
last <- findFirstLast(myDF, findFirst=FALSE)
data(iris)
data(iris)
first <- findFirstLast(iris)
last <- findFirstLast(iris, findFirst=FALSE)
first$Species
# setosa versicolor virginica
# 1 51 101
last$Species
# setosa versicolor virginica
# 50 100 150
iris[first$Species, ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
findFirstLast <- function(myDF, findFirst=TRUE) {
# myDF should be a data frame or matrix
# By default, this function finds the first occurence of each unique value in a column
# If instead we want to find last, set findFirst to FALSE. This will give `maxOrMin` a value of -1
# finding the min of the negative indecies is the same as finding the max of the positive indecies.
maxOrMin <- ifelse(findFirst, 1, -1)
# For each column in myDF, make a list of all unique values (`levs`) and iterate over that list,
# finding the min (or max) of all the indicies of where that given value appears within the column
apply(myDF, 2, function(colm) {
levs <- unique(colm)
sapply(levs, function(lev) {
inds <- which(colm==lev)
ifelse(length(inds)==0, NA, maxOrMin*min(inds*maxOrMin) )
})
})
}