这是我处理分组到不同关联的数据观察的代码。我想根据欧几里德距离计算每次观察他的描述和关联之间的距离。
for循环子集数据集在组号上。 for循环的每次迭代都会选择要处理的新行组。问题是我想存储每次迭代的计算。我怎么能这样做?
希望能够毫不含糊地描述这种情况,欢迎提问。任何大的偏差都会形成当前的代码或建议研究新的方法也是受欢迎的!
现状:
association description group
1: zzzz zzzz 1
2: zzzz efgh 1
3: zzzz hijk 1
4: aaaa lmno 2
5: aaaa pqrs 2
6: aaaa tuvw 2
7: aaaa qyza 2
8: aaaa bcde 2
9: bbbb fqhij 3
10: cccc klmn 4
理想的解决方案:
association description group distance
1: zzzz zzzz 1 1
2: zzzz efgh 1 0
3: zzzz hijk 1 0
4: aaaa lmno 2 0
5: aaaa pqrs 2 0
6: aaaa tuvw 2 0
7: aaaa qyza 2 0
8: aaaa bcde 2 0
9: bbbb fqhij 3 0
10: cccc klmn 4 0
库
library(tm)
library(dplyr)
计算距离的功能
euclidean.dist <- function(x1, x2) {
sqrt(sum((x1 - x2) ^ 2))
}
数据描述
association <- c('zzzz', 'zzzz', 'zzzz', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'bbbb', 'cccc')
description <- c('zzzz', 'efgh', 'hijk', 'lmno', 'pqrs', 'tuvw', 'qyza', 'bcde', 'fqhij', 'klmn')
group <- c(1,1,1,2,2,2,2,2,3,4)
distance <- 0
mytable <- data.table(association,description,group,distance)
索引for循环
ID <- length(unique(mytable$group))
要探索会发生什么,目前,请设置:
ID <- 1
For loop本身
for(i in ID) {
#for each unique group, select only the rows of one group at a time
#Get only the description column
x1 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]]) %>%
select(description)
#For the same rows, select the specific association of the group of rows
x2 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]] & row_number() == 1 | row_number()== n()) %>%
select(association)
#Rename the association column to description, so as to enable rbind
x2 <- rename(x2, description = association)
x3 <- rbind(x2, x1)
#Create distance column to store the values
x3$distance <- 0
#Transform to a corpus to weight the terms in each doc
mycorpus <- Corpus(DataframeSource(x3))
dtm <- DocumentTermMatrix(mycorpus,
control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
stopwords = FALSE))
#Create a matrix for measure
x4 <- as.matrix(dtm)
#Get all rows, except the first row
#The first row serves as input to calculate the euclidean for each row
rows <- (seq(1, nrow(x3) -1) +1 )
#Calculate for all rows the distance
#Leave the first row empty, as it could be removed
for(a in rows) {
x3$distance[i] <- euclidean.dist(x4[1,], x4[a,])
}
}
答案 0 :(得分:0)
以下内容将for
循环替换为lapply
。我个人更喜欢使用R中的*apply
函数族,因为它清楚它们将返回什么,而for
循环并不总是那么清楚。
我们基本相同,在ID的每个元素上迭代一系列函数。并非此序列中的for
循环也已更改为lapply
。
lapply(1:ID, function(i) {
x1 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]]) %>%
select(description)
x2 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]] & row_number() == 1 | row_number()== n()) %>%
select(association)
x2 <- rename(x2, description = association)
x3 <- rbind(x2, x1)
x3$distance <- 0
mycorpus <- Corpus(DataframeSource(x3))
dtm <- DocumentTermMatrix(mycorpus,
control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
stopwords = FALSE))
x4 <- as.matrix(dtm)
rows <- ( seq(1, nrow(x3) -1) + 1 )
lapply(rows, function(a) {
x3$distance[a] <<- euclidean.dist(x4[1, ], x4[a, ])
})
x3 %>% mutate(group = i)
}) %>%
do.call(what = bind_rows)