处理独特的数据子集

时间:2017-03-07 12:51:02

标签: r loops

这是我处理分组到不同关联的数据观察的代码。我想根据欧几里德距离计算每次观察他的描述和关联之间的距离。

for循环子集数据集在组号上。 for循环的每次迭代都会选择要处理的新行组。问题是我想存储每次迭代的计算。我怎么能这样做?

希望能够毫不含糊地描述这种情况,欢迎提问。任何大的偏差都会形成当前的代码或建议研究新的方法也是受欢迎的!

现状:

       association description group
 1:        zzzz        zzzz     1        
 2:        zzzz        efgh     1        
 3:        zzzz        hijk     1        
 4:        aaaa        lmno     2        
 5:        aaaa        pqrs     2        
 6:        aaaa        tuvw     2        
 7:        aaaa        qyza     2        
 8:        aaaa        bcde     2        
 9:        bbbb       fqhij     3        
10:        cccc        klmn     4        

理想的解决方案:

       association description group distance
 1:        zzzz        zzzz     1        1
 2:        zzzz        efgh     1        0
 3:        zzzz        hijk     1        0
 4:        aaaa        lmno     2        0
 5:        aaaa        pqrs     2        0
 6:        aaaa        tuvw     2        0
 7:        aaaa        qyza     2        0
 8:        aaaa        bcde     2        0
 9:        bbbb       fqhij     3        0
10:        cccc        klmn     4        0

library(tm)
library(dplyr)

计算距离的功能

euclidean.dist <- function(x1, x2) {
  sqrt(sum((x1 - x2) ^ 2))
}

数据描述

association <- c('zzzz', 'zzzz', 'zzzz', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'bbbb', 'cccc')
description <- c('zzzz', 'efgh', 'hijk', 'lmno', 'pqrs', 'tuvw', 'qyza', 'bcde', 'fqhij', 'klmn')
group <- c(1,1,1,2,2,2,2,2,3,4)
distance <- 0

mytable <- data.table(association,description,group,distance)

索引for循环

ID <- length(unique(mytable$group))

要探索会发生什么,目前,请设置:

ID <- 1

For loop本身

for(i in ID) {

#for each unique group, select only the rows of one group at a time
#Get only the description column
  x1 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]]) %>%
  select(description)

#For the same rows, select the specific association of the group of rows
  x2 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]] & row_number() == 1 | row_number()== n()) %>%
select(association)

#Rename the association column to description, so as to enable rbind
  x2 <- rename(x2, description = association)
  x3 <- rbind(x2, x1)

#Create distance column to store the values
  x3$distance <- 0

#Transform to a corpus to weight the terms in each doc
  mycorpus <- Corpus(DataframeSource(x3))
  dtm <- DocumentTermMatrix(mycorpus,
                        control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
                                       stopwords = FALSE))

#Create a matrix for measure
  x4 <- as.matrix(dtm)

#Get all rows, except the first row 
#The first row serves as input to calculate the euclidean for each row
  rows <- (seq(1, nrow(x3) -1) +1 )

#Calculate for all rows the distance
#Leave the first row empty, as it could be removed
  for(a in rows) {
    x3$distance[i] <- euclidean.dist(x4[1,], x4[a,])
  }
}

1 个答案:

答案 0 :(得分:0)

以下内容将for循环替换为lapply。我个人更喜欢使用R中的*apply函数族,因为它清楚它们将返回什么,而for循环并不总是那么清楚。

我们基本相同,在ID的每个元素上迭代一系列函数。并非此序列中的for循环也已更改为lapply

lapply(1:ID, function(i) {

  x1 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]]) %>%
    select(description)

  x2 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]] & row_number() == 1 | row_number()== n()) %>%
    select(association)

  x2 <- rename(x2, description = association)
  x3 <- rbind(x2, x1)

  x3$distance <- 0

  mycorpus <- Corpus(DataframeSource(x3))
  dtm <- DocumentTermMatrix(mycorpus,
                            control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
                                           stopwords = FALSE))

  x4 <- as.matrix(dtm)

  rows <- ( seq(1, nrow(x3) -1) + 1 )

  lapply(rows, function(a) {
    x3$distance[a] <<- euclidean.dist(x4[1, ], x4[a, ])
  })

  x3 %>% mutate(group = i)

}) %>% 
  do.call(what = bind_rows)