Question

我是R的新用户，我不太清楚如何改进以下脚本。我听说过apply函数但我没有设法使用它们。这是我的问题：

我有两个数据帧，第一个称为data，第二个称为eco。 data行数超过100万，eco为90.000。它们都有一个名为id的公共列。对于一个id，它们是data中的几行，对应于植物物种的存在。

我想通过为数据框id中的eco提供值来说明这一点，如果id中的data中存在或缺少某个特定物种。该信息将显示在sp中的eco列中。

带有for循环的脚本，需要几个小时才能运行：

for (k in (1:nrow(data))) {
if (data[k, "sp"]==1) #sp corresponds to one specific specie
{
eco[which(eco$id==data[k, "id"]), "sp"] = 1 # before this, the "sp" columnis empty in eco
}
}

我该如何改进？

非常感谢您的帮助。

Answer 1

有1,000,000条记录我会考虑使用data.table。如果您不介意在物种1不存在时返回data.table，则可以使用data[sp==1,][eco]的复合连接操作之一（NA）执行此操作。你有完美的设置。两张带有共用键的表。您可以这样轻松地执行此操作：

# Some sample data
set.seed(123)
data <- data.frame( id = rep( letters[1:3] , each = 3 ) , sp = sample( 1:5 , 9 , TRUE ) )
eco <- data.frame( id = letters[1:3] , otherdat = rnorm(3) )
data
   id sp
#1:  a  2
#2:  a  4
#3:  a  3
#4:  b  5
#5:  b  5
#6:  b  1 ===> species 1 is present at this id only
#7:  c  3
#8:  c  5
#9:  c  3

eco
#   id   otherdat
#1:  a -0.1089660
#2:  b -0.1172420
#3:  c  0.1830826


#  All you need to do is turn your data.frames to data.tables, with a key, like so...
require(data.table)
data <- data.table( data , key = "id" )
eco <- data.table( eco , key = "id" )

# Join relevant records from data to eco by the common key
# This way keep 0 when species 1 is present and 0 otherwise
eco[ data[ , list( sp = as.integer( any( sp == 1 ) ) ) , by = id ] ]
#   id   otherdat sp
#1:  a -0.1089660  0
#2:  b -0.1172420  1
#3:  c  0.1830826  0

# A more succinct way of doing this (and faster)
# is a compound join (but you get NA instead of 0)
data[sp==1,][eco]
#   id   sp   otherdat
#1:  a   NA -0.1089660
#2:  b TRUE -0.1172420
#3:  c   NA  0.1830826

Answer 2

这是你在找什么？

@Simon评论后编辑：

eco$sp <- 0                         #create new column `sp` initialized with 0
eco[eco$id %in% data$id[data$sp == 1],"sp"] <- 1  # replace 0 with 1 if for all id where data$sp == 1

使用apply函数避免for循环

2 个答案: