使用Apply(?)避免使用R循环

时间:2010-07-13 22:16:22

标签: r loops

我正在尝试将一个函数应用于数据集的每一行。该函数查找第二个数据集中的匹配行,并计算传递给它的产品详细信息的相似性分数。

如果我只是用测试号码调用它,该函数可以工作,但我无法弄清楚如何在我的数据集的所有行上运行它。我尝试过使用apply但无法使用它。

我将迭代不同的参数设置以找到最适合历史数据的设置,因此速度很重要......这意味着循环已经结束。您可以提供的任何帮助都将非常感激。

谢谢!艾伦

GetDistanceTest <- function(SnapshotDate, Cand_Type, Cand_Height, Cand_Age) {
    HeightParam <- 1/5000
        AgeParam <- 1
    Stock_SameType <- HistoricalStock[!is.na(HistoricalStock$date) & !is.na(HistoricalStock$Type) & as.character(HistoricalStock$date)==as.character(SnapshotDate) & HistoricalStock$Type==Cand_Type,]

    Stock_SameType$ED <- (HeightParam*(Stock_SameType$Height - Cand_Height))^2 + (AgeParam*(Stock_SameType$Age - Cand_Age))^2

    return(sqrt(sum(Stock_SameType$ED)))

}

HistoricalStock <- HistoricalAQStock[,c(1, 3, 4, 5)]
colnames(HistoricalStock) <- c("date", "Age", "Height", "Type")
Sales <- AllSales[,c(2,10,11,25)]
colnames(Sales) <- c("date", "Age", "Height", "Type")

GetDistanceTest("2010-04-01", 5261, 12, 7523) #works and returns a single number

res1 <- transform(Sales, ClusterScore=GetDistanceTest(date, Type, Height, Age))
        # returns Error in `$<-.data.frame`(`*tmp*`, "ED", value = c(419776714.528591, 22321257.0276852,  :  replacement has 4060 rows, data has 54
    # also 4 warnings, one for each variable. e.g. 1: In as.character(HistoricalStock$date) == as.character(SnapshotDate) :  longer object length is not a multiple of shorter object length

res2 <- apply(Sales, 1, GetDistanceTest, Sales$Type, Sales$Height, Sales$Age)
    # `$<-.data.frame`(`*tmp*`, "ED", value = c(419648071.041523, 22325941.2704261,  : replacement has 4060 rows, data has 13
    # also same 4 warnings as res1

3 个答案:

答案 0 :(得分:2)

我对你的代码采取了一些自由b / c我尝试使用merge函数来向量化副使用循环,你合并两个数据帧,并对“列”进行操作,这允许你使用R中内置的矢量化。我认为这将做你想要的(在第二行我只是确保AB不具有相同的值heightage,以便您的距离不总是零):

A <- B <- data.frame(date=Sys.Date()-9:0, stock=letters[1:10], type=1:10, height=1:10, age=1:10)
B$height <- B$age <- 10:1
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1/5000
age.param <- 1
temp <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )

答案 1 :(得分:1)

使用mapply,多变量形式的适用:

res1 <- mapply(GetDistanceTest, Sales$date, Sales$Type, Sales$Height, Sales$Age)

答案 2 :(得分:0)

以上评论代码:

A <- data.frame(date=rep(Sys.Date()-9:0,100), id=letters[1:10], type=floor(runif(1000, 1, 10)), height=runif(1000, 1, 100), age=runif(1000, 1, 100))
B <- data.frame(date=rep(Sys.Date()-9:0,1000), type=floor(runif(10000, 1, 10)), height=runif(10000, 1, 10), age=runif(10000, 1, 10))



AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1
age.param <- 1
AB$ClusterScore <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Scores <- ddply(AB, c("id"), function(df)sum(df$ClusterScore))