Question

我是R的新手，想要开发一个循环，允许我使用两个变量更新数据框中的列。

我有两个主要的数据帧--BaseData和CountyFile（两个样本 - 主文件大约是3m行）。

BaseData是一个起源和目的地列表，其中变量（1,2,3,4等）表示县的ID。因此，对于UserID 2，原点是County 1，Destination是County 2，依此类推。

BaseData
     UserID Origin Destination
1       1      1           1
2       2      2           1
3       3      3           2
4       4      4           4
5       5      1           2
6       6      1           3

CountyFile是一个数据框，它将包含Destination县（County_ID）和所有Origin县（C_1，C_2等）之间所有交互的总和。

CountyFile    
     CountyID C_1 C_2 C_3 C_4 C_5 C_6 C_7
 1         1   0   0   0   0   0   0   0
 2         2   0   0   0   0   0   0   0
 3         3   0   0   0   0   0   0   0
 4         4   0   0   0   0   0   0   0
 5         5   0   0   0   0   0   0   0
 6         6   0   0   0   0   0   0   0

我可以通过创建BaseData的子集（其中Destination == 1）来获取所需信息，对Origins进行分组和求和，然后更新CountyFile.C_1。

Temp1 <- subset(BaseData, Destination  == 1) 
Temp2 <- as.data.frame(table(Temp1$Origin))
CountyFile$C_1<-Temp2[match(CountyFile$CountyID, Temp2$Var1),2]

这会更新我的CountyFile数据帧以获取第一个目的地选择（见下文）。我想使用循环来遍历所有目标（01到15），而不是手动执行此操作。

    CountyID C_1 C_2 C_3 C_4 C_5 C_6 C_7
1         1   1   0   0   0   0   0   0
2         2   3   0   0   0   0   0   0
3         3  NA   0   0   0   0   0   0
4         4  NA   0   0   0   0   0   0
5         5  NA   0   0   0   0   0   0
6         6   1   0   0   0   0   0

我已经使用嵌套循环和下面的两个变量（i和j）做了一些努力，但无济于事。也许有人可以提供更容易的解决方案？

for (i in c(01,02,03,04,05,06,07,08,09,10,11,12,13,14,15)) 
{
Temp1 <- subset(BaseData, Destination  == i) 
Temp2 <- as.data.frame(table(Temp1$Origin)) }
for (j in c("C_1","C_2","C_3","C_4","C_5","C_6","C_7")) 
{
CountyFile$j<-Temp2[match(CountyFile$CountyID, Temp2$Var1),2]
}

由于

贾斯汀

Answer 1

在R中操作数据有更简单的方法。以下是使用0x42

的一种方法

data.table

我使用了这个示例数据

library(data.table)
# the actual code starts
setDT(BaseData)
# count the number of rows in each Destination, Origin combinations
CountData <- BaseData[, .N, by = .(Destination, Origin)]
# reshape the data
OutputData <- dcast(CountData, Destination ~ Origin)
# rename the columns
names(OutputData) <- c("CountyID", 
                       paste0("C_", 1:7))

输出如下：

# generate example data
N <- 500
BaseData <- data.frame(UserId = seq(N),
                       Destination = sample(15, N, TRUE),
                       Origin = sample(7, N, TRUE))

使用R中包含两个变量的循环更新数据帧

1 个答案: